0% found this document useful (0 votes)
121 views115 pages

Hardware-Software Codesign Course Overview

UNIT 3 ES NOTES

Uploaded by

Mayank Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views115 pages

Hardware-Software Codesign Course Overview

UNIT 3 ES NOTES

Uploaded by

Mayank Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

Outline

z Introduction to Hardware-Software Codesign


z System Modeling, Architectures, Languages
z System Partitioning
Hardware-Software Codesign z Co-synthesis Techniques
z Function-Architecture Codesign Paradigm
z Case Study
– ATM Virtual Private Network Server

1
2

This part of the course will be given in 6+1 weeks. The outline of this part will be given as follows: First, we’ll brief introduce what
There will be several labs that familiarize the usage of HDL simulators such as HW-SW codesign is. Next, we’ll talk about the concepts of the models,
ModelSim, the Mentor Graphics Seamless Coverification Environment, and other architectures and languages which are mainly used to specify the system we want
Cadence synthesis/simulation tools. to design. Then, we’ll talk about how to partition the system and decide which
part will be run in SW and what will be implemented as ASIC. After
We will have one term project based on the Digital Camera SoC found in Chapter partitioning, SW and HW part of the system has to be developed, and we’ll
7 of Frank Vahid’s book on Embedded System Design. present some co-synthesis technique which can accelerate the development of
HW and SW and their integration. In order to give an overall view of the
codesign process, we’ll introduce one of very famous academic codesign flow
built in UC Berkeley called “function-architecture codesign”. In the end, we will
present two case studies: ATM VPN (from CODES’2001) and Digital Camera
SoC (from Vahid’s Chapter 7).

1 2
Classic Hardware/Software Codesign Definition and
Design Process Key Concepts
z Basic features of current process: z Co-design
– System immediately partitioned into hardware classic design co-design
– The meeting of system-level objectives by
and software components exploiting the trade-offs between hardware and
– Hardware and software developed separately software in a system through their concurrent
design
z Implications of these features:
– HW/SW trade-offs restricted z Key concepts
• Impact of HW and SW on each other cannot be – Concurrent: hardware and software developed
assessed easily at the same time on parallel paths
– Late system integration – Integrated: interaction between hardware and
software developments to produce designs that
z Consequences of these features: meet performance criteria and functional
– Poor quality designs HW SW specifications HW SW
– Costly modifications
– Schedule slippages

Classical system design process partitions system requirements into hardware and The common definitions for HW/SW codesign are presented above. The two key
software, which are developed separately by different teams. Their designs are concepts involved in codesign are concurrent development of HW and SW, and
integrated only at the end of the design process. When hardware faults are integrated design. Integrated design allows interaction between the design of HW
discovered, software must try to compensate for the hardware faults because and SW and help to make trade-offs between HW and SW. Codesign techniques
hardware is difficult to change (compared to the ease of software changes). In using these two key concepts take advantage of design flexibility to create systems
addition, since HW and SW are developed separately, their effect of their that can meet stringent performance requirements with a shorter design cycle.
interaction and integration cannot be explored until the last phase. This often
results in suboptimal designs, costly modifications, delay to market, schedule
delays, etc. The main problem here is that faults are found TOO late! Since the main goal of codesign is to accelerate the system design which consists
of hardware and software, any technique used to help shortening the design time
can be categorized as codesign issue. In general, it is preferred that the clear
separation point between of HW and SW through the whole design process can be
deferred as late as possible.

3 4
Motivations for Codesign Driving Factors for Codesign
z Co-design helps meet time-to-market because z Reusable Components
– Instruction Set Processors
developed software can be verified much earlier.
– Embedded Software Components
z Co-design improves overall system performance, – Silicon Intellectual Properties
reliability, and cost effectiveness because defects found z Hardware-software trade-offs more feasible
in hardware can be corrected before tape-out. – Reconfigurable hardware (FPGA, CPLD)
– Configurable processors (Tensilica, ARM, etc.)
z Co-design benefits the design of embedded systems z Transaction-Level Design and Verification
and SoCs, which need HW/SW tailored for a particular – Peripheral and Bus Transactors (Bus Interface Models)
application. – Transaction-level synthesis and verification tools
– Faster integration: reduced design time and cost z Multi-million gate capacity in a single chip
– Better integration: lower cost and better performance z Software-rich chip systems
– Verified integration: lesser errors and re-spins z Growing functional complexity
z Advances in computer-aided tools and technologies
– Efficient C compilers for embedded processors
– Efficient hardware synthesis capability
5 6

Codesign has the following advantages: There are several driving factors for the codesign methodology as listed in the
(1) Shorter time to market, because software debug can be performed much slide. Mainly, systems are getting more and more complex in terms of either
earlier functionality or the gate counts, so it is no longer designed from scratch, instead it
is now more of an integration of reusable components, such as hardware IP and
(2) More optimal designs, because defects can be found and repaired earlier software COTS. Further, due to FPGA, hardware has now become more and
without having to compensate for others’ mistakes more flexible just as the software of the old days. All these flexibilities have
(3) Faster, better, verified integration, because of better design space resulted in the need for a better methodology with a wider coverage and lesser
explorations. design faults such as the codesign technique. Especially many CAD tools and
techniques have been making good progress, mature and efficient tools for either
software generation through the compiler and hardware generation through some
Co-design is especially important for the design of embedded systems or SOC
high level synthesis techniques are available nowadays. How to move forward the
because not only most of them require both the HW and SW in nature, but
overall system design considering both HW and SW together has become a very
also they are often designed under some cost and performance constraints
challenging and interesting topics.
which codesign can help to meet.

5 6
Categories of Codesign Problems Typical Codesign Process

z Codesign of embedded systems


System
– Usually consist of sensors, controller, and actuators
FSM- Description Concurrent processes
– Are reactive systems directed graphs (Functional) Programming languages
– Usually have real-time constraints
– Usually have dependability constraints HW/SW Unified representation
Partitioning (Data/control flow)
z Codesign of ISAs
– Application-specific instruction set processors (ASIPs) SW HW
– Compiler and hardware optimization and trade-offs Another
HW/SW Software Interface Hardware
Synthesis Synthesis Synthesis
z Codesign of Reconfigurable Systems partition
– Systems that can be personalized after manufacture for a
specific application
System Instruction set level
– Reconfiguration can be accomplished before execution or Integration HW/SW evaluation
concurrent with execution (called evolvable systems)

7 8

Listed are some of the ways in which the codesign methodology can be applied A typical codesign process starts with the system description in some kind of
to different kinds of systems. The first category of system is about the executable language such as SystemC or a combination of HDLs (hardware
embedded system which will be mainly what we’ll focus in this cost. Their description languages) Verilog/VHDL and HLLs (high level languages)
system consists of microcontroller and some hardware components which C/C++/Java. Partitioning is then performed on the functional specification by
involve SW and HW issues. Besides they are often subject to real-time using profiling tools or some other estimation techniques that tell a designer what
constraints. What the popular SOC targets is just various embedded system percentage of time is spent in executing which functions and where are the major
applications. The other category of system codesign often apply is about the bottlenecks or critical path of the system. Except for some regular systems such
application-specific instruction set processors. This kind of processors doesn’t as DSP, hw/sw partitioning is often not easily automated and needs human
have fixed architecture and instruction sets. Instead, according to the intervention due to the variety of functional applications. In addition, the search
application requirement, specific instruction sets CPU can be developed. Since of optimal hw/sw partitioning solution is NP-hard problem so some heuristic
the instruction sets are not fixed, how to generate the software and estimate the algorithms are required in pratice. Once a decision is made as to what functions
performance has become very important. The third category of system listed are to be implemented in hardware and what in software, they are designed
here is about some reconfigurable system. These systems provide some simultaneously by different design teams, but they must use a common medium
reconfiguration capability which can be configured for different applications. of verification such as a cosimulation environment in which system errors that
How to generate the software to configure these system will be one of the result from hardware and software interactions can be detected and fixed. The
codesign issues. interface between HW/SW has also to be implemented. Finally, the whole system
is integrated at a lower instruction set level. If the current partition of HW/SW
can not meet the constraints of the system. Another iteration of the design
partition can be tried. Different partitionings can be synthesized to perform a
guided exploration of the design space.

7 8
Requirements for the Ideal Codesign
Codesign Process Environment
z System specification
– Models, Architectures, Languages
z Unified, unbiased hardware/software representation
– Supports uniform design and analysis techniques for hardware and
z HW/SW partitioning software
– Architectural assumptions:
– Permits system evaluation in an integrated design environment
• Type of processor, interface style, etc.
– Allows easy migration of system tasks to either hardware or software
– Partitioning objectives:
• Speedup, latency requirement, silicon size, cost, etc.
z Iterative partitioning techniques
– Partitioning strategies: – Allow several different designs (HW/SW partitions) to be evaluated
• High-level partitioning by hand, computer-aided partitioning technique, – Aid in determining best implementation for a system
etc. – Partitioning applied to modules to best meet design criteria (functionality
– HW/SW estimation methods and performance goals)
z HW/SW synthesis z Integrated modeling substrate
– Operation scheduling in hardware – Supports evaluation at several stages of the design process
– Instruction scheduling in compiler – Supports step-wise development and integration of hardware and
– Process scheduling in operating systems software
– Interface Synthesis z Validation Methodology
– Refinement of Specification – Insures that system implemented meets initial system requirements

Some details of what codesign process involves are listed in this slide. (The main There are several requirements for an ideal codesign environment, including a
concepts were discussed in the previous slide.) The first step in designing a specification language that is not biased towards either hardware or software, such
system is specifying its functionality which can be described by suitable as using SystemC, UML, or some other more generic modeling technique such as
languages. However, the feature of the system functionality can be best captured Petri nets, dataflow diagrams, etc. Further, the codesign environment should be
able to allow a designer to iteratively change the hardware software partitions for
by constructing the appropriate abstraction model for the system. Next step of
efficient design space exploration.
the design has to map the model of the system to some underlying architecture.
Here architecture describes what style of the target system will be implemented. Hardware and software must be cosimulated and coverified in a common
modeling substrate such as Mentor Graphics Seamless CVE (coverification
The mapping of the functionality to architecture has to be decides by partitioning.
environment).
To proceed the partitioning, we have to decide what the object of the partitioning
is, and partitioning strategies. Besides, some estimation methods have to be
applied to evaluate the partitioning results. After some partitioning decisions
have been made, the HW and SW part of the system has to be generated
respectively. The details of HW and SW synthesis will involve the scheduling of
operation or instructions, the generation of the interface between HW and SW,
etc. The partitioning result of HW/SW has also to be back annotated to refine the
original specification.

9 10
Models & Architectures

Specification + Models
Constraints (Specification)
Models, Architectures, Languages
Design
Process

Implementation Architectures
(Implementation)

Models are conceptual views of the system’s functionality


11
Architectures are abstract views of the system’s implementation
12

We’ll go to the details of the codesign process by first introducing the concepts of The system design process starts from the specification of the system
models, architectures and the languages which are used to describe the system. functionality with some given design constraints and then follow the design flow
and go through a series of design process to achieve final implementation of the
system. Here the starting point of the system design is the system specification,
Models are specifications and architectures are implementations of the and models provides just some conceptual views of the system’s functionality at
specification through a refinement process. Languages help in this design process different design phases. Models can be defined as a formal system consisting of
by making things executable, simulatable, verifiable, and thus less error-prone. objects and composition rules, and is used for describing a system’s
characteristics. Typically we can use a particular model to decompose a system
in to pieces, and then generate a specification by describing theses pieces in a
particular language. A language can capture many different models, and a model
can be captured in many different languages. Designers can choose different
models during different design phases to emphasize the aspects of the systems.
For example, during the specification phase, the designer knows nothing beyond
the functionality of the system, so he tends to use a model that won’t reflect any
implementation details while during the implementation phase, the designer can
switch to a model that can capture the system’s structure.

Models can only describe exactly how the system work but don’t tell how the
system is to be implemented. Therefore, the model has to be finally mapped to
some architecture which serves to define the model’s implementation by
specifying the hardware components as well as the connection between then. In
other words, architectures are abstract views of the system’s implementation.

Models and architecture are conceptual and implementation views on the highest
level of abstraction. The design process is the set of design tasks that transforms
a model into an architecture.

11 12
Architectures Implementing the
Models of an Elevator Controller
Elevator Controller

13 14

One examples of model, architecture, and languages can be illustrated from an As described before, the models of the system will be finally mapped to some
elevator controller example. Here the target elevator system can be describe by underlying architectures for implementation. This slides shows two alternative
English as shown in figure (a). The natural language such as English is not a architectures at different abstraction levels for the elevator controller system.
good specification language to specify the system as it is ambiguous and not Figure (a) is a register-level implementation which use state register to hold the
executable. Figure (b) and (c) shows two different models of the elevator current state, and the combinational logic to implement state transitions and
controller. Fig (b) represents the controller as a set of programming statement values of output signals. Figure (b) the system is implemented by a
while Fig (c) represents the controller as a state machines. These two models all microcontroller where a variable in a program is used to represent the current
consists of a set of objects and the interactions among them. The state-machine state and statements in the program to calculate the state transitions and values of
model consists of a set of states and transitions of states which is best suited to output signals. In this architecture, the program is stored in the memory and
represent a system’s temporal behavior as it allows a designer to explicitly executed by the processor.
express the modes and mode transitions caused by some external or internal
events. The algorithm models, consists of a set of statements that are executed
under a control sequence that uses branching and looping. Although this model These two architectures represent total two different implementation approaches.
cannot represent the concept of explicit states, it can specify a system’s input- By selecting different target system architecture at the beginning of the design
output relationship in terms of a sequence of statements, and best suited to phase, it will surely affect the following design processes.
represent the procedural view of the system.

13 14
State-Oriented: Finite State Machine
HW/SW System Models
(Moore Model, Mealy Model)

z State-Oriented Models
– Finite-State Machines (FSM), Petri-Nets (PN), Hierarchical
Concurrent FSM
z Activity-Oriented Models
– Data Flow Graph, Flow-Chart
z Structure-Oriented Models
– Block Diagram, RT netlist, Gate netlist
z Data-Oriented Models
– Entity-Relationship Diagram, Jackson’s Diagram
z Heterogeneous Models
– UML (OO), CDFG, PSM, Queuing Model, Programming
Language Paradigm, Structure Chart

16

There are numerous methods that are candidates to be used for a unified A finite-state machine is an example of a state-oriented model and also the most
representation. These models can be classified as five distinct categories. Most popular model for describing control systems since the temporal behavior of such
all of them have been tried in one codesign system or another with varying levels systems is most naturally represented in the forms of states and transitions
of success. Typically the methods are more suited to systems of a certain type, between states. The transition between states are trigged by some external events
e.g., data flow diagrams are more suited to data driven applications like Digital such as clock, primary inputs, etc. There are general two forms of the FSM
Signal Processing (DSP) systems. which distinguish in how to associate with the output of the system. In
(transition-based) Mealy FSM, the outputs of the system depend on both the state
and the input value; while in (state-based) Moore FSM, the outputs of the system
A state-oriented model represents the system as a set of states and a set of depends only on the states. This slide represents the same function of the elevator
transition between them which are triggered by external events. This model is controller in the state-based Moore FSM model. In principal, the Moore and
most suitable for control system where the system’s temporal behavior is the most Mealy state machines can be translated from each other. However, the Moore
important aspect of the design. state model will require more states than the Mealy model in the representation
for the same functionality.
An activity-oriented model describes the system as a set of activities related by
data or execution dependencies. This model is suitable for some transformation In addition to the simple Mealy and Moore FSM, there is another important
systems such as DSP systems. extended FSM called FSM with a datapath where the events influencing the
transitions between the states now can be represented as a set of status
A structure-oriented model is usually used to describe a system’s physical expressions as logic relations. Besides, the output values can also be described as
modules and interconnections between them. This model is usually used for the expression of some equations. The salient feature of FSMD is that it can be
representing the system’s architecture. suitable for both control and computation dominated systems.

A data-oriented model emphasizes the data structure used in the system. It The merits and demerits of the FSM can be summarized as follows:
represent the system as a collection of data related by their attributed, class
membership , etc. This model would be most suitable for information systems Merits
such as database where the function of the system is less important than the data Represent system’s temporal behavior explicitly
organization.
15 Suitable for control-dominated systems 16
Codesign Finite State Machine State-Oriented: Petri Nets
(CFSM)
z Globally Asynchronous, Locally Synchronous (GALS)
model

F B=>C
C=>F

C=>G
C=>G G
F^(G==1)

C=>A
C CFSM2
CFSM1
CFSM1 CFSM2
C

C=>B
A
B
C=>B
(A==0)=>B

CFSM3

18

The conventional FSM is not suitable for describing complex systems as it cannot The Petri net model is another type of state-oriented model defined to model
explicitly support concurrency and hierarchy. Due to this, the FSM representation systems that comprise interacting concurrent tasks. The Petri net consists of a set
may lead to the problem of state explosions for complex system. This slide shows of places, a set of transitions, and a set of tokens. Tokens reside in places, and
some various version of FSM called codesign finite state machine, one of the circulate through the Petri net by being consumed and produced whenever a
GALS model. Here this example consists of three local synchronous FSM which transition fires. A transition can fire only if each of its input places has at least
are running concurrently. The interaction between these three FSM are through one token. When a transition fires, all of its enabling tokens from input places
some asynchronous communication channel. The communication channel will act will be removed, and deposited one token into each output place. This slide
like the external events for the local FSM. shows several system characteristics that Petri Nets can model. Fig (a) shows the
model of “sequencing” where transition t1 fires after t2. Fig (b) shows the
modeling of “non-deterministic branching” as only one of t1 or t2 will fire. Fig
(c) models the “synchronization” as t1 can fire only after both input places have
tokens. Fig (d) represents the concept of “resource contention” where t1 and t2
compete for the same token which resides in the place of the center. Fig (e)
models the concept of “concurrency” as some transition like t1 and t4 can fire at
the same time.

In addition to be able to represent several key system characteristics, Petri net


models can be used to check and validate certain useful system properties such as
safeness and liveness. Safeness guarantees that the number of tokens in the net
won’t grow indefinitely. Liveness guarantees a dead-lock will not happen in the
system.

The merits and demerits of the Petri nets can be summarized as follows:

Merits

Good at modeling and analyzing concurrent systems


1782 18
Extensive theoretical and experimental works
Activity-Oriented: Heterogeneous:
Data Flow Graphs (DFG) Control/Data Flow Graphs (CDFG)
z Graphs contain nodes corresponding to operations in
either hardware or software
z Often used in high-level hardware synthesis
z Can easily model data flow, control steps, and
concurrent operations because of its graphical nature

5 X 4 Y
Example: + + Control Step 1

+ Control Step 2

+ Control Step 3

19

So far we have introduced several models which are basically suitable to describe Since most of the systems exhibit several characteristics, single model
the system state changes in response to some external events. These state representation very often cannot capture the system very well, Therefore, some
transition characteristic is inherent in most of the control systems. However, for traditional models can be modified or combined as a heterogeneous model. This
some transformational systems where the outputs are determined by a set of figure presents one of that example in so called control/data flow graphs (CDFG)
computations on the system’s input, they are hard to capture using state-oriented which is able to explicitly show both the data dependence as well as the control
model. In fact, they are captured by a so called “data flow graph” model. DFG sequence of a system. In the example shown in this slide, the result of (5+X) is
consists of a set of nodes and a set of edges. There are various types of nodes scheduled to be computed at control step 1, and its result will be used in another
including input, and output or the activity node which represent the activities of addition node scheduled in step 2. This mode is used in many high level
computation that transform or manipulate data. The nodes are connected by specification tools such as Ptolemy, SES Workbench, etc.
directed edges which represents the data dependence between the nodes. For the
example shown in the slide, this system consists of two activities A1, and A2, the
latter can be further decomposed into A21, A22, and A23, Here after A1 has [Kumar95]
been computed and the data Y is produced, the activity of A2 can start.

The merits and demerits of the DFG can be summarized as follows:

Merits

Support hierarchy

Suitable for specifying complex transformational systems

Represent problem-inherent data dependencies

Demerits

Do not express control sequencing or temporal behaviors

Weak for modeling embedded systems


19 20
Object-Oriented Paradigms (UML, …) Heterogeneous:
Object-Oriented Paradigms (UML, …)
z Use techniques previously applied to software to
Object-Oriented Representation
manage complexity and change in hardware modeling
z Use OO concepts such as Example:
– Data abstraction
– Information hiding 3 Levels of abstraction:
– Inheritance
z Use building block approach to gain OO benefits
Register ALU Processor
– Higher component reuse
Read Add Mult
– Lower design cost
Sub Div
Write
– Faster system design process AND Load
Shift Store
– Increased reliability

21

As the system becomes more and more complex, the conventional models In object-oriented representation, each OO block can support any level of
discussed in previous slides become somehow hard to capture complex system abstraction. This is a simple example of three different levels of hardware
functionality. Therefore, some modeling techniques previously used in software abstraction that can be described in an OO representation.
engineering have now been applied in hardware modeling. One of such key
modeling techniques is so called the object-oriented paradigms. The salient
features of object-oriented representation is that it can support data abstraction, [Kumar95]
information hiding and inheritance. The detailed manipulation process of data
will be hided from outside. Each OO building block interact with each other
through abstract channel. In SOC design, the concept of IP (intellectual property)
or component reuse has become a very popular practice. Through the OO
modeling, the feature of component reuse can be easily captured.

21 22
Characteristics of conceptual models

z Concurrency
– Data-driven concurrency
– Control-driven concurrency
z State transitions
z Hierarchy
– Structural hierarchy
– Behavior hierarchy
z Programming constructs
z Behavior completion
z Communication
z Synchronization
z Exceptional handling
z Non-determinism
z Timing
23 24

This slide shows the application of UML on the overall design phase of the SOC The specification languages are used to specify the system characteristics. From
design. the description of the conceptual models, it can be seen that different conceptual
models possess different characteristics. This slide presents some of the
characteristics most commonly found in conceptual models. In general, it will be
preferred that more characteristics can be captured in a single specification
language.

For the design of the embedded system, the specification requirements include
state transition, behavior hierarchy, concurrency, exception, programming
constructs, and behavior completion.

23 24
Separate Behavior from Micro-
architecture IP-Based Design of the Implementation
Which Bus? PI? Which DSP
zSystem Behavior zImplementation Architecture AMBA? Processor? C50?
– Functional specification of – Hardware and Software Dedicated Bus for Can DSP be done on
DSP? Microcontroller?
system – Optimized Computer
– No notion of hardware or Can I Buy External DSP
an MPEG2 I/O Processor
software! Which
Processor? Microcontroller?

Processor Bus
Mem
13
Which One? MPEG DSP RAM ARM? HC11?
User/Sys External DSP
Rate
Buffer
Control
3 Sensor I/O Processor
12
Peripheral Control

Processor Bus
Synch
Control
4
MPEG DSP RAM Processor
Transport
Rate
Buffer
Video
Frame
Buffer
Video Audio System How fast will my
Front
End 1 Decode 2 5
Decode 6
7
Output 8
Peripheral Control Decode User Interface
RAM
Processor
Rate
Buffer
Software run? How
Audio
Audio
9 Decode/ System Much can I fit on my
Output 10 Decode
RAM Do I need a dedicated Audio Decoder?
Microcontroller?
Mem
Can decode be done on Microcontroller?
11

25

So far, we have introduce several models and how they can be used to This shows what issues we will face during the implementation of the
describe a system’s functionality, data, control and structure. An typical DSP application. The architecture we use is based on some
architecture is intended to supplement these descriptive models, common bused interconnect by some IP. IP can be regarded as pre-
specifying how the system will be actually implemented. Architectures verified hardware library blocks which are designed for the intention of
can range from simple controllers to massively parallel processors. reuse in several system. Through the reuse of the IP, the design
Depending on different target architectures, the mapping from the model productivity of the system can be highly increased.
to the architecture implementation may require different design process.
A system behavioral description should be unbiased and implementation
independent, thus it will allow the maximum flexibility in system
architecture exploration and optimization.

25 9 26
AMBA-Based SoC Architecture Languages

z Hardware Description Languages


– VHDL / Verilog / SystemVerilog

z Software Programming Languages


– C / C++ / Java

z Architecture Description Languages


– EXPRESSION / MIMOLA / LISA

z System Specification Languages


– SystemC / SLDL / SDL / Esterel

z Verification Languages
– PSL (Sugar, OVL) / OpenVERA

27 28

This slide shows a typical AMBA-based SoC architecture which is a very Conceptual models we have mentioned before are used to understand, organize
common platform for SOC implementation. This platform consists of two buses: and define a system’s functionality. To further capture these models in a concrete
AHB (Advanced High Performance Bus) and APB (Advanced Peripheral Bus) form, system specification languages will be required. Especially in the earliest
interconnected by a bridge which acts as a slave on the AHB side and a master on design phase, it is preferred that the system’s conceptual view can be described in
the APB side. The microprocessor and other high performance cores are attached terms of an executable specification language which is capable of capturing the
to the AHB bus and other slower peripheral devices are attached as slaved on the system functionality in a machine-readable and simulatable form. Such an
APB bus. approach has several advantages. First, simulating such a specification allows the
designer to verify the correctness of the system’s intended functionality.
Secondly, the executable specification can serve as an input to synthesis tools.
Third, this specification can also be a comprehensive documentation.

This slides lists several popular specification languages which specify the system
from different aspects. Depending on the application, one of more languages
may be used in the same design of the system. Among the language list,
hardware description language and software programming languages should be
already addressed in some other courses. The system specification language is
also introduced before. Here the architecture description and verification
language will be discussed.

27 28
Hardware Description Languages
SystemVerilog 3.0
(HDL)
z VHDL (IEEE 1076) z Built-in C types

z Verilog 1.0 (IEEE 1364) z Synthesis improvements (avoid simulation and


synthesis mismatches)
z Verilog 2.0 (IEEE 1364)
z Enhanced design verification
z SystemVerilog 3.0 (to be sent for IEEE review)
– procedural assertions
z SystemVerilog 3.1 (to be sent for IEEE review) z Improved modeling
– communication interfaces

29 30

Here are some of the current state-of-art hardware description languages. Verilog SystemVerilog 3.0 has several enhancements compared to Verilog 2.0. There are
is used on a large scale in Taiwan. VHDL is the more preferable language in several more built-in C types allowing designers more flexibility in writing HDL
Europe. In USA, both VHDL and Verilog are used on an equal basis. VHDL is code. Mismatches between simulation and synthesis result in simulatable but
more typed than Verilog. SystemVerilog is still under development and unsynthesizable designs. SystemVerilog tries to solve these mismatches.
standardization. Assertions that help design verification and embedding of designer intents into a
system at the code level have been added to SystemVerilog. There are several
more enhanced communication interfaces.

29 30
SystemVerilog Environment for
SystemVerilog 3.1
Ethernet MAC
z Testbench automation
– Data structures

– Classes

– Inter-process communication

– Randomization

z Temporal assertions
– Monitors

– Coverage Analysis

31 32

SystemVerilog 3.1 includes more testing and verification mechanisms such as This diagram shows the SystemVerilog environment that was used to design an
testbench data structures, classes, inter-process communication and Ethernet MAC system. The blue assertions and the green testbenches have been
randomization and temporal assertions. made possible by SystemVerilog.

31 32
Architecture Description Architecture Description
Language in SOC Codesign Flow
z Objectives for Embedded SOC Design Architecture IP Library

– Support automated SW toolkit generation Specification Estimators Description


Language M1
• exploration quality SW tools (performance estimator, profiler, …) P2
• production quality SW tools (cycle-accurate simulator, memory-aware P1
compiler..)
Hw/Sw Partitioning
Compiler
HW SW
Verification
VHDL, Verilog
Simulator C
Architecture
ADL Model
Synthesis
Architecture
Description
Compiler Synthesis Compiler
File
Formal Verification
Cosimulation
Rapid design space exploration
– Specify a variety of architecture classes (VLIWs, DSP, RISC, On-Chip Processor Quality tool-kit generation
ASIPs…) Memory Core

– Specify novel memory organizations Off-Chip


Memory Design reuse
– Specify pipelining and resource constraints Synthesized
HW Interface

33 34

Architecture description languages are intended to describe a set of various This figure shows how architectures can be described using an ADL (architecture
architectural alternatives available for system design. It may specify different description language) and then used throughout of the SoC hardware-software
architecture class such as DSP, RISC etc or different memory organization. It codesign process. Here the IP can be described using ADL which can support
may also specify different pipelining and resource constraints. Hopefully through some information for the perfomance estimation, or the SW compilation and the
the different specification of architectures, it can also aid for the estimation and HW/SW cosimulation.
generation of the SW.

33 34
Property Specification Language
(PSL)
z Accellera: a non-profit organization for standardization
of design & verification languages
z PSL = IBM Sugar + Verplex OVL
z System Properties System Partitioning
– Temporal Logic for Formal Verification
z Design Assertions
– Procedural (like SystemVerilog assertions)
– Declarative (like OVL assertion monitors)
z For:
– Simulation-based Verification
– Static Formal Verification
– Dynamic Formal Verification
36
35

Property Specification Language is a language standard which aids for the After introducing several key issues in developing a functional system
verification of the system. Base on this standard language, some formal specification including model, architecture and languages, we will now turn our
verification techniques can be applied to provide another verification approach. focus on the system design itself. In the system design, one of the most
challenging and important tasks is finding a way to partition the system’s
functionality among various system components so that all of the design
constraints can be satisfied. We’ll address the general partitioning issues while
the HW/SW partitioning can be regarded as a special case.

35 36
System Partitioning
(Functional Partitioning) Issues in Partitioning
z System partitioning in the context of hardware/software
codesign is also referred to as functional partitioning High Level Abstraction
z Partitioning functional objects among system components is
done as follows
– The system’s functionality is described as collection of indivisible Decomposition of functional objects
functional objects
– Each system component’s functionality is implemented in either • Granularity
hardware or software • Metrics and estimations
z Constraints • Partitioning algorithms
– Cost, performance, size, power • Objective and closeness functions
z An important advantage of functional partitioning is that it Component allocation
allows hardware/software solutions
z This is a multivariate optimization problem that when
automated, is an NP-hard problem
Output

For the system or functional partitioning approach, we first decompose the The material is presented in the sequence in which we encounter these issues in
system’s functionality into non-divisible pieces called functional objects. Those a typical partitioning process:
objects can be partitioned among system components, after which we can First, the specification abstraction level defines the input in terms of the
implement each component’s functionality either as hardware or as software.. abstraction level of the conceptual model.
It is analogous to Structural Partitioning in which the structure of a system is There are several specification abstraction level considered by various
refined into lower level hardware components. However, in Structural partitioning techniques including task-level data flow graph, tasks, arithmetic-
Partitioning, functionality cannot be moved from hardware to software. level dataflow graph, FSM with datapath, register transfer and structure.
We are then concerned with the granularity of the functional objects into which
[Gajski94]. the input is decomposed.
Metrics, partitioning algorithms, objective and closeness functions are used to
determine which partition to implement.
The system component allocation process chooses components, from among
those available, to implement the partition.
Finally, we have the output, which is a fully implemented system.

[Gajski94].

37 38
Granularity Issues in Partitioning System Component Allocation

z The granularity of the decomposition is a measure of z The process of choosing system component types
the size of the specification in each object from among those allowed, and selecting a number
of each to use in a given design
z The specification is first decomposed into functional
z The set of selected components is called an
objects, which are then partitioned among system
allocation
components
– Various allocations can be used to implement a
– Coarse granularity means that each object contains a large specification, each differing primarily in monetary cost and
amount of the specification. performance
– Fine granularity means that each object contains only a – Allocation is typically done manually or in conjunction with a
small amount of the specification partitioning algorithm
• Many more objects z A partitioning technique must designate the types of
• More possible partitions system components to which functional objects can
– Better optimizations can be achieved be mapped
– ASICs, memories, etc.

For functioning partitioning, the system function is first decomposed into several An integral part of the partitioning problem is allocating portions of the system to
functional object. The granularity is a measure of the size of each object. Coarse actual components for their implementation. Obviously, the system “tasks” must
granularity means that each object contains a large amount of the specification be allocated to components that are capable of performing them. The set of
while the fine-granularity contains less amount. Although finer granularity may selected components is called an allocation. Various allocations may be able to
potentially lead to better result, there are several drawbacks it may suffer. First, implement a given specification, each differ in cost and performance. Allocating
since fine granularity will produce more objects, partitioning must use more is typically either done manually, or performed in conjunction with a partitioning
computation time or it will yield poor results. Second manual interaction is more algorithm.
difficult to support. Besides, the partitioning output is less comprehensible to
humans. Finally, estimation is more difficult. Therefore a reasonable partitioning
granularity should be considered in practice. Note that it is the system [Gajski94].
functionality that is being partitioned here in order to achieve a better allocation
and assignment to hardware or software. That is, a number of objects in a
partition defined here may be assigned to the same hardware or software later.

[Gajski94].

39 40
Metrics and Estimations Issues Computation of Metrics

z A technique must define the attributes of a partition


z Two approaches to computing metrics
that determine its quality
– Creating a detailed implementation
– Such attributes are called metrics
• Produces accurate metric values
• Examples include monetary cost, execution time,
communication bit-rates, power consumption, area, pins, • Impractical as it requires too much time
testability, reliability, program size, data size, and memory size
• Closeness metrics are used to predict the benefit of grouping – Creating a rough implementation
any two objects • Includes the major register transfer components of a design
z Need to compute a metric’s value • Skips details such as precise routing or optimized logic, which
– Because all metrics are defined in terms of the structure (or require much design time
software) that implements the functional objects, it is difficult • Determining metric values from a rough implementation is called
to compute costs as no such implementation exists during estimation
partitioning

A technique must define the attributes of a partition that determine if the partition There are two basic approaches to computing metrics. The first is to create a
is good or bad. Such attributes are called metrics which could be the area, detailed implementation and directly measure the metrics of interest. The second
execution time, power consumption, pins and reliability. Metrics must be defined is to estimate the given metric from the abstract system model in use at the time.
to determine a partition’s relative cost vs. other potential partitionings. Obviously, Although the first approach can provide very accurate metric, it is nearly
some metrics, such as execution time of a given task on a specific processors, are impractical because it takes too much time to finish an implementation not
impossible to measure precisely until a final implementation is made. Therefore, mentioning that there are usually many candidate partitions existing. Therefore,
accurate, fast “cost” estimation is mandatory for a good partitioning algorithm. second approach is usually preferred that relies some algorithms to estimate the
metric of partitions in very short period.
In addition to those metrics which evaluate a complete partition, there is also
another type of metric called closeness metric. Such metric is used to predict the [Gajski94].
benefit of grouping two functional objects. For example, if two objects which
share many same variables, their closeness could be higher.
[Gajski94].

41 42
Objective and Closeness
Estimation of Partitioning Metrics
Functions
z Deterministic estimation techniques z Multiple metrics, such as cost, power, and
performance are weighed against one another
– Can be used only with a fully specified model with all data
– An expression combining multiple metric values into a single
dependencies removed and all component costs known
value that defines the quality of a partition is called an
– Result in very good partitions Objective Function
z Statistical estimation techniques – The value returned by such a function is called cost
– Used when the model is not fully specified – Because many metrics may be of varying importance, a
weighted sum objective function is used
– Based on the analysis of similar systems and certain design
• e.g., Objfct = k1 * area + k2 * delay + k3 * power (equation 1)
parameters
– Because constraints always exist on each design, they must
z Profiling techniques be taken into account
– Examine control flow and data flow within an architecture to • e.g Objfct = k1 * F(area, area_constr)
determine computationally expensive parts which are better + k2 * F(delay, delay_constr)
realized in hardware + k3 * F(power, power_constr)

Metrics must be used to guide the partitioning process. The type of metrics used Since partitioning evaluation is not evaluate based on a single metric, multiple
depends a great deal on the level of description of the system. metrics may be considered simultaneously to determine the overall evaluation
How to estimate the metric will largely depends on what metric itself is evaluated. result of the partition. Therefore, the object functions base on varying weights for
However, the estimation techniques can be classified into three categories. The area, timing, and power constraints may be used to reflect their relative
deterministic techniques provides good estimation but the fully specified model importance in each different system being designed. For example, the following
should be built in advance. While this technique produces accurate metric values, objective function is a weight sun function in which three metric values, area,
it is impractical, because it requires too much time. A fully specified model may delay, and power are weighted by constant k1, k2, and k3 respectively, and then
require days or months to create. summed. Giving k1, a large value than k2, and k3 makes area the most important
metric. Since most design decision are driven by constraints, simple function such
The statistical estimation method can be used based on the analysis of similar as equation 1 are rarely used. Constraints must be incorporated into the function
system such that the model may not be fully specified. so that partitions that meet constraints are considered better than those that don’t.
For example, we can extend the above objective function as follows. Where F is a
[DeMicheli93]. function indicating how the metric’s estimate is to the given constraint. A
common form of F returns the amount by which the metric’s estimate violates the
constraint, returning zero when there is no violation. This form of F caused the
objective function to return zero when a partition meets all constraints, making
the goal of partition to obtain a cost of zero.

Closeness function, similar to the objection function, consists of the expression of


several metrics. However, closeness function is usually used to guide the
partitioning during the partitioning process. While an objective function combines
metrics to evaluate a partition, a closeness function combines closeness metrics
to indicate the desirability of grouping objects, before a complete partition exists.

[Gajski94].

43 44
Partitioning Algorithm Classes Iterative Partitioning Algorithms
z Constructive algorithms z The computation time in an iterative algorithm is
– Group objects into a complete partition spent evaluating large numbers of partitions
– Use closeness metrics to group objects, hoping for a good z Iterative algorithms differ from one another primarily
partition in the ways in which they modify the partition and in
– Spend computation time constructing a small number of which they accept or reject bad modifications
partitions
z The goal is to find global minimum while performing
z Iterative algorithms
as little computation as possible
– Modify a complete partition in the hope that such
modifications will improve the partition
– Use an objective function to evaluate each partition
– Yield more accurate evaluations than closeness functions
used by constructive algorithms
B
z In practice, a combination of constructive and A, B - Local minima
iterative algorithms is often employed A
C - Global minimum
C

A partitioning algorithm can be classified into two general categories. One is The concept of a local minimums is demonstrated in the following graph, which
constructive algorithm which starts from decomposed functional objects. The shows a sequence of moves and the cost after each move. A move is a partition
partitioning is built and constructed by organizing or grouping these functional modification that remaps an object from one group to another. The leftmost point
objects. How to group the objects usually depends on the defined closeness represents the cost of an initial partition. After several moves, the cost decreases
function. The computation time in a constructive algorithm is spent constructing a to A. But the next move increases the cost. A is referred to as a local minimums.
small number of partitions. B is also a local minimums.
In the graph above, C represents a solution which would consume fewer resources
Another class of algorithm belongs to iterative algorithm which will iteratively (in HW and SW) than A or B, yet it might be very difficult to find, and solution A
modify the initial partitioning to in the hope to achieve the better result. Such may be “good enough.”
algorithm use an objective function to evaluate each partition, which yields more An algorithm that accepts only moves that decrease cost is call a greedy
accurate evaluations than closeness function used by constructive algorithms. algorithm. Such algorithms cannot escape local minimums. An algorithm that
can escape a local minimums is often called hill-climbing algorithm.
In practice, a combination of constructive and iterative algorithms is often Multivariate optimization is a much-studied problem in CS. The interested reader
employed. Basic partition algorithm are described later. should investigate this area to understand the problems inherent in optimal
partitioning.

[Gajski94].
[Gajski94].

45 46
Basic Partitioning Algorithms Hierarchical Clustering
z Constructive Algorithm z One of constructive algorithm based on closeness
– Random mapping metrics to group objects
• Only used for the creation of the initial partition. z Fundamental steps:
– Clustering and multi-stage clustering – Groups closest objects
– Ratio cut – Recompute closenesses
– Repeat until termination condition met

z Iterative Algorithm z Cluster tree maintains history of merges


– Cutline across the tree defines a partition
– Group migration
– Simulated annealing
– Genetic evolution
– Integer linear programming

(a)
(c) (d)
47 (b) 48

A partition algorithm maps each function object to exactly one group, where each Hierarchical clustering belongs to the class of constructive algorithm. This
group represents a system component. Ideally, the partition that is produced by algorithm starts with a initial set of functional objects. The initial set can be
the algorithm is the one which yields the minimal cost, as computed by the created using random mapping algorithm. Hierarchical clustering will try to
selected objective function. iteratively group the closest objects together. The closeness of the object will be
Several common partitioning algorithms are listed here. The most trivial judged by the closeness function.
algorithm is Random mapping which can be used to generate initial partitioning. Closeness metrics are intended to yield a partition with good global metric
In this mapping, where each object is randomly assigned to one of the given values. The approach groups the closest objects, recomputes closenesses after the
components. The computational complexity of the algorithm is O(n), where n is grouping, and repeats until some termination condition is met. The complexity of
the number of objects. The remaining algorithms will be discussed in the the algorithm is dominated by the computation of closeness between all object
subsequent slides. pairs, which is O(n2).
After one iteration of grouping, the closeness function will be recomputed so next
stage of functional grouping can start.
For example, Figure (a) shows four objects – o1, o2, o3 and o4 – and their
closeness values. The two closest objects are o1 and o2 , with a closeness value of
30. In Figure (b), we merge o1 and o2 ,and approximate the closeness values
between the new object and o3 and o4 as a average of previous closeness values. In
Figure (c), we again merge the closest objects and compute a new closeness
value. Assuming that the algorithm terminates when no closeness exceeds a
threshold of 15, the final partition is o1, o2, o3 in one group and o4 in another
group.

47 48
Ratio Cut Greedy Partitioning
z A constructive algorithm that groups objects until a z Two-way partition algorithm between the groups of
terminal condition has been met. HW and SW.
z A new metric ratio is defined as z Suffer from local minimum problem.
cut ( P )
ratio =
size( p1 ) × size( p2 ) Repeat
– Cut(P):sum of the weights of the edges that cross p1 and p2. P_orig=P
– Size(pi):size of pi. for i in 1 to n loop
if Objfct(Move(P,o))<Objfct(P) then
z The ratio metric balances the competing goals of P=Move(P,o)
grouping objects to reduce the cutsize without end if
grouping distance objects. end loop
Until P=P_orig
z Based on this new metric, the partition algorithms try
to group objects to reduce the cutsizes without
grouping objects that are not close.

49 50

Ratio cut, in fact, can be regarded as one of the multi-stage clustering. The This slides shows one of the iterative partitioning algorithm called two-way
metric used in ratio cut will include the size of the functional objects. One simple partitioning. Here this greedy algorithm starts from a initial partitioning which
closeness function example is called cut-size which represents number of edges consists of two sets. This algorithm will try to move the object one at a time
between a pair of objects. If the clustering algorithm is built based on the cut- from one set into another set to check if it will lower the partitioning cost. If the
size, because very often the tiny object has smaller number of edges, it is possible cost increase, the object will not be moved at all. This algorithm is quite simple,
that a large object will always be grouped with a tiny object although they are not but most of the time it can also provide us the local minimum result.
really close. To take into account the size of objects, the ratio cut will be used as The algorithm begins by creating an all-hardware partitioning, thus guaranteeing
the closeness function instead. Ratio cut algorithm groups objects until a that a performance-satisfying partition is found if it exists (actually, certain
termination condition has been met, meaning that no objects are considered close behaviors considered unconstrainable are initially placed in software ). A
enough to be merged. The new termination condition relies upon the definition of performance-satisfying partition is one in which all performance constraints are
a new partition metric. A partition can be evaluated by the cutsizes of its groups, satisfied . To move an object in the algorithm require not only that cost be
since small cutize indicate that close objects have been grouped, as desired. improves, but also that all performance constraints still be satisfied (actually they
However, if cutsize is the only metric, then the best cutsize is reached by merging require that maximum interfacing constraints between groups be satisfied). Once
all objects into one group, even if the objects are not considered close. To avoid a behavior is moved, the algorithm tries to move closely related objects first.
ending up with a single group we could impose a constraint on the number of While greedy algorithms are fast, their chief drawback is that they cannot escape
objects or the size of a group, but such a constraint would prevent us from finding a local minimum.
good partitions consisting of unbalanced group sizes. Base on the above
considerations, the goal of ratio cut partition is to group objects to reduce the
cutsizes without grouping objects that are not close, and without constraining the
size of a group. This goal is achieved by replacing the cutsize metric with a new
metric called ratio. The ratio metric will balance the competing goals of
grouping objects to reduce the cutsize without grouping distant objects. The
denominator of the ratio equation encourages maintaining multiple groups with
balanced sizes.

49 50
Group Migration Simulated Annealing
z Another iteration improvement algorithm extended z Iterative algorithm modeled after physical annealing
from two-way partitioning algorithm that suffers from process to avoid local minimum problem.
local minimum problem. z Overview
z The movement of objects between groups depends – Starts with initial partition and temperature
on if it produces the greatest decrease or the smallest – Slowly decreases temperature
increase in cost. – For each temperature, generates random moves
– To prevent an infinite loop in the algorithm, each object can – Accepts any move that improves cost
only be moved once. – Accepts some bad moves, less likely at low temperatures
z Results and complexity depend on temperature
decrease rate

51 52

Group migration algorithm can improve the simple two-way partitioning that The group migration algorithm escapes local minimum by accepting cost-
suffers the local minimum problem. This algorithm will include some control increasing moves if they are part of a sequence of moves that leads to a lower-
strategy. In two-way partitioning, the object can be moved from one group into cost partition. However, the complexity is limited by moving each object only
another group only if this move can decrease the cost. This algorithm cannot once in the sequence. The simulated annealing algorithm also accepts cost-
detect a case in which no single move decreases the cost, but a sequence of two increasing moves. In contrast to group migration, simulated annealing may move
or more moves does decrease the cost. To solve this problem, the condition of each object more than once, limiting the complexity by decreasing over time the
move will be relaxed and the move of objects between groups in each iteration is tolerance for accepting cost-increasing moves
allowed if either it produces the greatest decrease of the smallest increase in cost. The simulated annealing, intended to model the annealing process in physics,
To prevent an infinite loop in the algorithm, each object can only be moved once. where a material is melted and its minimal energy state is achieved by lowering
After all the objects have been moved once, we select the lowest-cost partition the temperature slowly enough that equilibrium is reached at each temperature .
we have encountered. The entire algorithm is iterated using the new partition as The algorithm starts with an initial partition and initial simulated temperature,
the initial partition, until we no longer encounter a lower-cost partition we repeats and then the temperature is slowly decreased. For each temperature, random
is usually less than five. moves are generated. The algorithm accepts any move that improves the cost;
The algorithm is easily extended for multiway partitioning. In two-way otherwise, it may still accept the move but such acceptance becomes less likely at
partitioning, we tentatively moved every object to its opposite group to see which lower temperatures.
object move produced the lowest cost. In multiway partitioning, we tentatively
moved every object to every other group. In an alternative extension for multiway
partitioning, we first create two groups using two-way partitioning, and then
repeatedly partition each group into two groups, again using two-way
partitioning. We continue partitioning groups until we obtain the desired number
of groups.

51 52
Simulated Annealing Algorithm Genetic Evolution
Temp = initial temperature z Genetic algorithms treat a set of partitions as a generation,
Cost = Objfct(P) and create a new generation from a current one by
While not Frozen loop imitating three evolution methods found in nature.
while not Equilibrium loop
P_tentative = Move(P) z Three evolution methods
cost_tentative = Objfct(P_tentative) – Selection: randomly select partition
∆cost = cost_tentative - cost – Crossover: randomly select two strong partitions, replicate trait
if( Accept(∆ cost, temp) > Random(0,1) ) then
– Mutation: randomly select partition, randomly modify (move object)
P = P_tentative
cost = cost_tentative z Produce good result but suffer from long run times.
end if
/*Evolve generation*/
end loop
While not Terminate loop
temp=DecreaseTemp(temp) G = Select(G,num_sel) U Cross(G,num_cross)
End loop − Δtemp
cost Mutate(G,num_mutate)
where Accept ( Δcost , temp) = min(1, e ) If Objfct(BestPart(G)) < Objfct(P_best) then
P_best = BestPart(G)
end if
end loop

53 54

This slides shows the algorithm of simulated annealing partition method. The The group migration and simulated annealing algorithms improve a current
key part of this algorithm is the acceptance procedure which is used to determine partition by moving a number of objects, saving the best partition encountered,
whether to accept a move based on the cost improvement and current and iterating the process with this best partition. However, we need not restrict
temperature. This procedure will return a value and if that value is greater than a ourselves to save only partition from one iteration to the next.
random number between 0 and 1, the move will be accepted. As we can see, if Genetic evolution modeled after the genetic evolution process can also be applied
we lower the temperature (i.e. decrease the value of temp), the value procedure to the partitioning applications. This algorithm will regard a set of partitions as a
returns will become smaller, and the possibility of acceptance will become small. generation. Genetic algorithm create a new generation from three possible
The algorithm promises to achieve a good results while the temperature has methods. One is selection which randomly choose a low-cost partition and
gradually decreased. copies it to the next generation. The second method is crossover which randomly
Theoretical studies have shown that the simulated-annealing algorithm can climb selects two strong two partitions and replicates a trait of one in the other. The
out of a local minimum and find the globally optimal solution if the processes final method is called mutation which randomly selects a partition and modifies it
reaches an equilibrium state at each temperature, and if the temperature is by moving some randomly selected objects.
lowered infinitely slowly. The above conditions require an infinite number of Genetic algorithm complexity is heavily influenced by the Terminate procedure’s
iterations at an infinite number of temperatures, which is clearly impractical, so form. Like simulated annealing, genetic algorithm usually produce good results
several heuristic approaches have been developed to control the simulated- but suffer from long run times. Also, since genetic algorithms maintain multiple
annealing process. These heuristics define the equilibrium state and describe how partitions, they require more memory.
to lower temperature.

53 54
Estimation Accuracy vs. Speed
z Estimates allow z Accuracy: difference between estimated and actual value
– Evaluation of design quality E ( D )−M ( D )
– Design space exploration
A = 1− M (D)
z Design model E ( D) , M ( D) : estimated & measured value
– Represents degree of design detail computed z Speed: computation time spent to obtain estimate
– Simple vs. complex models Estimation Error Computation Time

z Issues for estimation


– Accuracy
– Speed
– Fidelity
Actual Design
Simple Model

z Simplified estimation models yield fast estimator but result in


greater estimation error and less accuracy.
55 56

In the previous slides, we described some of the basic issue and techniques for The accuracy of an estimate is a measure of how close the estimate is to the
partitioning a system’s functionality among various system component, such that actual value of the metric measured after design implementation. Perfect
constraints on various design metric, such as performance and area, would be estimation has an accuracy equals 1. The accuracy depends on the degree of the
satisfied. In order to determine if these constraints have actually been satisfied, detail in the model. Estimators based on simple models execute rapidly but may
however, we must be able to obtain metric estimates as rapidly as possible. Here, not provide the accuracy good enough.
we will describe some of the techniques that can be used for such rapid A design model may incorporate several aspects of the design. For example, a
estimation of both hardware and software quality metrics. detailed area estimate would require determining the number and the size of the
The goal of estimation used in the partition or system design is to evaluate the memories, functional units, registers and multiplexers in addition to the wiring
design quality of the current design or partition based on some given design area. These calculations would in turn involve performing tasks such as
metric. This evaluation can provide us if the constraint of the design will be met functional-unit allocation, variable lifetime analysis, functional-unit binding and
or it can help to explore the possible design space. floorplanning. Estimators based on such detailed design models require longer
While we make some estimation of the given design, we’ll assume some computation times, but also produce more accurate estimates, and therefore allow
architectures which the design will be implemented on. In other word, a design a better selection of design alternatives.
model should be built for the quality metric to estimate. Surely the model can be
simple or complex depending our estimation goal. In general, we want the
estimation can be as more accurate as possible, but it may suffer long
computation time. We won’t every estimation will take lot of time because it
may reduce the design space we can explore. Another estimation issue is fidelity.
Fidelity reflect the relative correctness of the estimation results.

55 56
Fidelity Quality Metrics
z Estimates must predict quality metrics for different design alternatives z Performance Metrics
z Fidelity: % of correct predictions for pairs of design implementations – Clock cycle, control steps, execution time, communication
z The higher fidelity of the estimation, the more likely that correct rates
decisions will be made based on estimates. z Cost Metrics
F = 100 × n ( n −1) ∑i =1 ∑ j =i +1 ui , j
z Definition of fidelity: 2 n n
– Hardware: manufacturing cost (area), packaging cost(pin)
Metric
– Software: program size, data memory size
estimate z Other metrics
(A, B) = E(A) > E(B), M(A) < M(B) – Power
– Design for testability: Controllability and Observability
(B, C) = E(B) < E(C), M(B) > M(C)
measured – Design time
(A, C) = E(A) < E(C), M(A) < M(C) – Time to market
Design points
A B C Fidelity = 33 %

57 58

If estimation of metric is to check if the design can meet the target constraint, The quality metrics we are often interested in system design are discussed. The
then the accuracy of the estimation result will become very important .The essential two metrics includes the performance metrics and the cost metric. The
fidelity of an estimation method is defined as the percentage of correctly performance of the system can be regarded from many aspects including
predicted comparisons between design implementations. execution time, communication rates, clock cycle, or control steps.
However, if the estimation is to compare different implementations or different Performance metric can be divided into computation and communication metrics.
partitions in order to find the best one, fidelity becomes a more important issue. Computation metrics measure the time required to perform the computations
Fidelity will reflect the correctness of the predictions. The higher the fidelity of within a behavior. These metrics can be classified according to the type of the
the estimation, the more likely that correct design decisions will be made based time units used in computation of the metric. Communication metrics are related
on comparing the estimates of two implementations. to the time spent by a behavior in interacting with the other behaviors in the
During the selection of one of several design implementations, predictions of system. Although the execution time can reflect the performance best, it will
design quality based on estimates with high fidelity will result, on average, in need more efforts to estimate.
better designs. Fidelity depends upon the design model used to estimate the The cost of system can be measure in silicon area or the program size depending
design parameter. In general, the more accurate the model, the higher the fidelity on what applications you are interested. In addition to metrics, there are other
of estimation metrics you can consider including power, time-to-market, and the testability.

57 58
Scheduling Control Steps Estimation
z A scheduling technique is applied to the behavior z Operations in the specification assigned to control step
description in order to determine the number of
control steps. z Number of control steps reflects:
– Execution time of design
z It is quite expensive to obtain the estimate based on – Complexity of control unit
scheduling.
z Techniques used to estimate the number of control steps
z Resource-constrained vs time-constrained in a behavior specified as straight-line code
scheduling. – Operator-use method
– Scheduling

59 60

The scheduling technique may also be applied to the behavior in order to A control step corresponds to a single state of the control unit state machines.
determine the number of control steps. Usually the scheduling algorithm can Operations in the functional specification are assigned to these control steps
produce very good estimation of control step, however, the computation during synthesis. Each operation may take different number of control steps.
complexity of the algorithm may be high. Scheduling techniques can roughly be Therefore, the estimation of control steps can be said to find out the number of
divided into two classes. One is resource constraint scheduling which assumes cycles required to execute the system. The number of steps can reflect the
constrained number of functional units are given in advance. Given resource- execution time and the complexity of the control unit.
constrains, the operations in the behavior are scheduled with a view to The number of control steps needs for a behaviors execution can be estimated in
minimizing the total number of control steps. For each type of operation in the several ways. We first describe two techniques - Operator-use method and
behavior, the list scheduling algorithm maintains a priority list from which Scheduling algorithm – for estimating the number of control steps in a behavior
operations are assigned to control steps. At any stage, the priority list for an specified as straight-line code. We then present a technique for estimating the
operation type constrains the set of operations that have all their predecessors number of control steps for a behavior that contains control constructs such as
already scheduled. Thus, all predecessors of an operation must be executed branching and iteration.
before that operation itself is executed. The other one is time-constrained
scheduling which is to find out a schedule that can meet the timing requirement
with the least amount of resources. Here, we’ll introduce two simple ways to obtain the control step number in the
behavior. We will first focus on the behaviors where no branches will occur.

59 60
Operator-use Method Operator-use Method Example
z Differential-equation example:
z Easy to estimate the number of control steps given
the resources of its implementation.
z Number of control steps for each node can be
calculated:
⎡ ⎡ occur (ti ) ⎤ ⎤
csteps(n j ) = max ⎢ ⎢ ⎥ × clocks(ti )⎥ (1)
⎣ ⎢ num(ti ) ⎥
ti ∈T

z The total number of control steps for behavior B is

csteps( B) = max csteps(n j )


ni ∈N (2)

61 62

First method is called operator-use method. It will first divide all statement in a This slide shows a example of operator-use method. The differential equation’s
behavior into a set of nodes in such a way that all the statements in a node could behavior can be first described as a set of statements. Next, we can analyze the
be executed concurrently. Therefore, this method simply considers the number data dependence existing in these statements. Then, independent statements will
of operations required and the number of functional units available.. be grouped together and the operator-use method will be applied to each block.
Let T represent the number of distinct types of operations in a behavior B. Let Finally, the final estimation result is sum of the control steps required for each
num(ti) and clock(ti) represent the number and delay (in clock cycles) of block.
functional units available to implement operations of type ti . Then, if there are The operator-use method can be extended to incorporate pipeline functional units
occur(ti) occurrences of an operation, Equation (1) gives the number of steps and include memory accesses.
needed to execute operations of type ti . The number of control steps needed for The operator-use method can provide fairly rapid estimates of the number of
any node nj , csteps(nj) ,is equal to the maximum number of control steps needed control steps required by a behavior. If there are n operations in the behavior, the
to perform operations of any type in the node. Equation (1) gives the details. method has a computational complexity of O(n). However, there is potential for
Once the control steps for each node have been determined, the total number of error, since the method operate at a statement-level granularity and ignores the
control steps required by behavior B is determined as Equation (2), where N is dependencies between the operation within a statement.
the number of nodes in B.
In conclusion, the operator-use method will generate more accurate estimates if
each statement in the specification is restricted to one operation.

61 62
Execution Time Estimation Probability-based flow analysis
z Average start to finish time of behavior
z Straight-line code behaviors
execution( B) = csteps( B) × clk (1)
z Behavior with branching
– Estimate execution time for each basic block
– Create control flow graph from basic blocks
– Determine branching probabilities
– Formulate equations for node frequencies
– Solve set of equations

execution( B) = ∑ exectime(b ) × freq(b )


bi ∈B
i i
(2)

(a) (b) (c)

63 64

The estimation of the execution time for the straight-line code is straightforward. In order to estimate the execution frequency for each basic block, the control flow
We can estimate first the number of control step csteps(B) and then multiply it by graph model has to be first built . Based on the model, we can determine the
the clock period. However, in the general case, a behavior may consist of execution frequencies of each node using flow analysis based on the branching
sequential statements that have branching and iteration constructs. Since probability. Branch probability measures how often a branch is executed after
branching may be dependent upon input data, we have to analyze the frequency evaluating the condition associated with a branching statement. The probability
of each statement. can be computed statically for the loop case or obtained dynamically by
We first determine the set of basic blocks for the behavior. Since each basic block simulating the behaviors on several sets of sample data, recording how often the
consists of a set of sequential assignment, we can determine the execution time various branched were executed and consequently, deriving the probabilities for
for each block by first determining the number of control steps it requires. To the individual branches.
determine the execution time for the entire behavior, the execution time We will illustrate the probability-base flow analysis method with an example of
exectime(bi) for each basic block bi needs to be weighed by the execution the VHDL behavior in the Figure(a). The basic blocks for VHDL statements are
frequency of the basic block. The execution frequency, freq(bi), for a basic block shown in Figure(b). The control flow graph of Figure(c) has a vertex for each
bi is defined as the number of times that the basic block will executed on the basic block in the behavior. An edge exists between two vertices if control could
average, during a single execution of the behavior to which it belongs. Once the be transferred between the corresponding basic blocks.
execution frequencies have been determined, exectime(B) is estimated as in
Equation (2).

63 64
Probability-based flow analysis
Communication rate
z Flow equations:
z Communication between concurrent behaviors (or
freq(S) = 1.0 processes) is usually represented as messages sent
freq(v1) = 1.0 x freq(S) over an abstract channel.
z Communication channel may be either explicitly
freq(v2) = 1.0 x freq(v1) + 0.9 x freq(v5)
specified in the description or created after system
freq(v3) = 0.5 x freq(v2) partitioning.
freq(v4) = 0.5 x freq(v2) z Average rate of a channel C, average (C), is defined
freq(v5) = 1.0 x freq(v3) + 1.0 x freq(v4) as the rate at which data is sent during the entire
channel’s lifetime.
freq(v6) = 0.1 x freq(v5) average(C ) = Total _ bits ( B ,C )
comptime ( B ) + commtime ( B ,C )

z Node execution frequencies: z Peak rate of a channel, peakrate(C), is defined as the


rate at which data is sent in a single message
freq(v1) = 1.0 freq(v2) = 10.0 transfer. peakrate(C ) = bits ( C )
protocol _ delay ( C )
freq(v3) = 5.0 freq(v4) = 5.0
freq(v5) = 10.0 freq(v6) = 1.0 65 66

This slide show a example for probability based flow analysis. The execution Communication channel may be either specified in the behavior description of
frequency for any node depends on the weighted execution frequency of all its created when a variable being accessed by a portion of the behavior is assigned to
immediate predecessor nodes. We can derive a set of flow equations based on different chip or block. Since the channel is part of the overall system, its effect
this relation. The flow equation can be solve to give you the node execution has to be analyzed as well.
frequencies. To evaluate the channel, several terms such as average rate and peak rate have
In the above method, we obtained the execution time of the entire behavior by been defined.
associating the execution time of the corresponding basic block with each node The total number of bits sent over a channel C during the lifetime of behavior B
and performing flow analysis. Probability-based flow analysis can also be applied is total_bits(B,C) .
to determine other useful design characteristics by associating different types of
information with each node in the control flow graph, For example, if each node Computation time, comptime(B) ,is defined as the time required by the behavior
in the control flow graph represents the number of memory accesses in the B to perform ins internal computations. Communication time is defined as the
corresponding basic block, flow analysis will yield the average number of time spent by the behavior in accessing data external to the behavior. The
memory accesses during execution of the entire behavior. Similarly, if we communication time required by a behavior B to transfer data over channel C is
associate the number of calls made to a certain procedure in the basic block with denoted by commtime(B,C). bits(C) represent the number of bits received or sent
each node, we can determine the number of calls made by the entire behavior to over channel C in a single message.
the procedure. Finally, by associating the number of data transfers over a channel protdelay(C) represent the delay associated with the transfer of a single message
in the corresponding basic block with each node, flow analysis will yield the total over the channel C.
number of channel accesses by the entire behavior.

65 66
Communication Rate Example
Area Estimation
z Two tasks:
– Determining number and type of components required
– Estimating component size for a specific technology (FSMD, gate
arrays etc.)
z Behavior implemented as a FSMD (finite state machine
with datapath)
– Datapath components: registers, functional units,
multiplexers/buses
– Control unit: state register, control logic, next-state logic
z Area can be accessed from the following aspects:
– Datapath component estimation
– Control unit estimation
– Layout area for a custom implementation

67 68

Estimation of the size of design has to first determines the number and the type of
components required and the size of each component for a variety of technology.
If we assume that our design is implemented on a FSMD, then the component of
FSMD includes data path component like functional unit, registers etc and the
control unit. Therefore, the estimation of area can be done for data path and
control unit.

67 68
Clique-partitioning
Clique-partitioning
z Commonly used for determining datapath components
z Let G = (V,E) be a graph,V and E are set of vertices and
edges
z Clique is a complete subgraph of G
z Clique-partitioning
– divides the vertices into a minimal number of cliques
– each vertex in exactly one clique
z One heuristic: maximum number of common neighbors
– Two nodes with maximum number of common neighbors are
merged
– Edges to two nodes replaced by edges to merged node
– Process repeated till no more nodes can be merged

69 70

To estimate the number of data path component required, a clique partitioning This slide shows an example of clique partitioning example.
approach is often used. A graph G=(V,E) is first derived from the behavior where
node V represents a operation and edge E will connect nodes that can be mapped
to the same functional units. A complete subgraph or a clique of G represents a
functional unit for all the operation in the subgraph. The number of functional
units required is to find the minimal number of cliques that cover all the nodes.

69 70
Storage Unit Estimation Storage Unit Estimation
z Variables used in the behavior are mapped to storage
units like registers or memory.
z Variables not used concurrently may be mapped to
the same storage unit
z Variables with non-overlapping lifetimes have an edge
between their vertices.
z Lifetime analysis is popularly used in DSP synthesis
in order to reduce number of registers required.

71 72

Storage units like registers are required for holding data values represent by the
constants, variables and arrays in the behavior. Similar to the functional unit, we
also want to reuse the storage units to reduce the number of registers required. In
order to do that, we can use the clique partitioning technique or we can use the
lifetime analysis method.

71 72
Functional-unit and interconnect-unit
Software Estimation Model
Estimation
z Clique-partitioning can be applied z Processor-specific estimation model
z For determining the number of FU’s required, construct a – Exact value of a metric is computed by compiling each
graph where behavior into the instruction set of the targeted processor
using a specific compiler.
– Each operation in behavior represented by a vertex
– Estimation can be made accurately from the timing and size
– Edge connects two vertices if corresponding operations information reported.
assigned different control steps and there exists an FU that can
– Bad side is hard to adapt an existing estimator for a new
implement both operations
processor.

z For determining the number of interconnect z Generic estimation model


– Behavior will be mapped to some generic instructions first.
units, construct a graph where – Processor-specific technology files will then be used to
– Each connection between two units is represented by a vertex estimate the performance for the targeted processors.
– Edge connects two vertices if corresponding connections are not
used in same control step

73 74

Clique-partitioning can be applied to estimate the minimum number of functional Processor-specific: adv. The estimates obtained are very accurate, since the
units and register requires. The interconnection channel required for connecting behavior is actually compiled for execution on the selected processor. However,
the functional units can also be derived by this technique. Here we assume the such an estimation approach requires a dedicated compiler for each processor.
functional units shares some common communication channels. Consequently, it is difficult to adapt an existing estimator for a new processor. In
addition, compiling a behavior for the estimation purposes is computationally
expensive.
Compared with the processor-specific model, the generic estimation model has
several advantages. First, the generic estimation model does not require different
compilers and estimators for different target processors. Instead, only a single
compiler, estimator and a set of technology files is requires for software
estimation. Second, the generic model makes retargeting the estimator to a new
processor much easier. Retargeting consists of providing a technology file for the
new processor. In the processor-specific model, we would require a compiler for
the new processor in addition to the timing and size information of the
processor’s instruction set. Finally, the peculiarities of each type of processor can
be reflected in the technology file for the processor.
A disadvantage of generic model is the lower accuracy of its estimates largely
because the generic instruction set represent only a small portion of processor’s
entire instruction set.

73 74
Deriving processor technology files
Software estimation models
Generic instruction

dmem3=dmem1+dmem2

8086 instructions 6820 instructions

Instruction clock bytes Instruction clock bytes


mov ax,word ptr[bp+offset1] (10) 3 mov a6@(offset1),do (7) 2
add ax,word ptr[bp+offset2] (9+EA1) 4 add a6@(offset2),do (2+EA2) 2
mov word ptr[bp+offset3],ax) (10) 3 mov d0,a6@(offset3) (5) 2

technology file for 8086 technology file for 68020


Generic instruction Execution size Generic instruction Execution size
time time

…………. 35 clocks 10 …………. 22 clocks 6


dmem3=dmem1+dmem2 bytes dmem3=dmem1+dmem2 bytes
…………. ………….

75 76

The behavior is first compiled into a set of generic three-address instructions. The generic instruction given in figure is compiled into a sequence of three 8086
Processor-specific technology files are made available containing information instructions.
about the number of clock cycles and bytes that each type of generic instruction Here, dmem indicates a direct memory addressing mode. The generic instruction
requires. is first mapped to a sequence of target processor instructions. The total number of
The estimator computes the software metrics for the behavior based on the clock cycles of the generic instruction is obtained by summing the clock cycles of
generic instructions and the technology files for the target processors. the generic instruction in the sequence. EA1 and EA2 in figure are the effective
address calculation times used for displacement memory addressing mode, which
The proposed generic instruction set consists of the following five class: are six and eight clock cycles on the 8086 and 68020 processor, respectively.
1.Arithmetic/logic/relational instructions: <des←src1 op src2> Thus, the generic instruction will take 35 and 22 clock cycles on 8086 and 68020
2.Move/load/store instructions: <des ← src> processors, respectively. Using a similar approach we can derive the number of
bytes that each type of generic instruction will require if it is compiled in 8086 or
3.Conditional jump instruction: <if cond goto label>
68020 processor.
4.Unconditonal jump instruction: <goto label e>
The total size of the three instructions is 10bytes, which is entered in the
5.Procedure call instruction: <call label> technology file for the 8086 processor as the size of the generic instruction under
consideration.

75 76
Software estimation
Refinement
z Program execution time z Refinement is used to reflect the condition after the
– Create basic blocks and compile into generic instructions partitioning and the interface between HW/SW is built
– Estimate execution time of basic blocks – Refinement is the update of specification to reflect the
– Perform probability-based flow analysis mapping of variables.
– Compute execution time of the entire behavior: z Functional objects are grouped and mapped to
exectime(B) = δ x ( Σbi∈B exectime(bi) x freq(bi)) (1) system components
δ accounts for compiler optimizations – Functional objects: variables, behaviors, and channels

z Program memory size – System components: memories, chips or processors, and


buses
progsize(B)= Σg∈G instr_size(g)
z Specification refinement is very important
z Data memory size – Makes specification consistent
datasize(B)= Σd∈D datasize(d) – Enables simulation of specification
– Generate input for synthesis, compilation and verification
tools
77 78

The program execution time can be estimated by first dividing the program into After partitioning the system behavior to several functional objects and mapping
the basic blocks which consist of only straight line statements. Next, the these objects to suitable objects. The original specification should be further
probability flow analysis can be applied to find out the execution frequencies of refined to reflect the update architecture. It can makes the specification
each basic block. The overall execution time can be summed by the execution consistent in all aspects , and it also makes the specification simulatable. In
time of each basic block. addition, the refined specification that is generated serves as an input for
Finally, equation 1 can be used to obtain the average-case software execution verification, synthesis and compilation tools that may follow the system design.
time for the entire behavior. Here, we present the set of specification refinement tasks, First, we describe the
In order to estimate the performance corresponding to the optimized code, we refinement tasks associated with the implementation of variable and channel
need to know the optimization ratio of the compiler used by the designer to groups. Second, we introduce mechanisms for resolving access conflicts that may
generate the machine instructions. The performance-optimization ratio , δ, is arise when several behaviors concurrent access variables or channels that have
defined as the ratio of optimized code performance to the non-optimized code been grouped together into a memory or a bus. Third, we discuss the effect of
performance. binding behaviors of the pin structure and protocols used for communication.
Methods for interfacing two standard components with incompatible protocols
Based on the size of each generic instruction, the program-memory size if are also presented. Finally, we discuss issues related to the implementation of
behavior is then computed as the sum of those generic instructions in the communication between behaviors that have been assigned to hardware and
behavior. If a behavior B is compiled into a set of generic instructions G, and software system components,
instr_size(g) represents the size of the generic instruction g, then the size of the
program required for behavior B is computed as progsize(B).
The data memory size is determined by examining the data declarations in the
functional specification. The data memory size, datasize(d), of a declaration d is
determined by the size of d’s base type and number of elements in d. the base
type of any declaration is an indivisible type defined in the language. The data
memory size of a behavior B can be computed by summing that for each its
declarations. The data memory size estimation describe here is accurate under the
assumption that all variables have lifetimes equal to the execution time of the
behavior, which means we do not share memory locations for two variables.

77 78
Refining variable groups Communication
z The memory to which the group of variables are z Shared-memory communication model
reflected and refined in specification. – Persistent shared medium
z Memory address translation – Non-persistent shared medium
– Assignment of addresses to each variable in group z Message-passing communication model
– Update references to variable by accesses to memory – Channel
• uni-directional
V (63 downto 0) => MEM(163 downto 100)
• bi-directional
variable J, K : integer := 0; variable J, K : integer := 0;
• Point- to-point
variable V : IntArray (63 downto 0); variable MEM : IntArray (255 downto 0);
.... .... • Multi-way
V(K) := 3; MEM(K +100) := 3;
X := V(36); X := MEM(136); – Blocking
V(J) := X; MEM(J+100) := X; – Non-blocking
.... ....
for J in 0 to 63 loop
SUM := SUM + V(J);
for J in 0 to 63 loop
SUM := SUM + MEM(J +100);
z Standard interface scheme
end loop; end loop; – Memory-mapped, serial port, parallel port, self-timed,
.... ....
synchronous, blocking
79 80

The first step to refined the original specification is to refine the variable groups. A system usually consists of several concurrent behavior which need to
The variables used in the behaviors have to be folded into the memory location or communicate some data or control information with each other. Usually this
the registers. In addition, all the memory reference of the variables used in the communication can be conceptualized in either share-memory or message
specification should also be translated into the real position. passing paradigms. In shared-memory communication model, the sending
System partitioning maps a group of variables to an allocated memory with a behavior or process writes data to a shared medium while the receiving process
fixed number of words and width, different variables may have different sizes. can read the data from the medium. Shared medium can also be of the persistent
Variable folding refers to the assignment of the bits in a variable to the bits in of non-persistent. A persistent medium such a a storage element can retain the
each word of memory. value written into it by the process until that value has been rewritten by a new
one. Non-persistent medium cannot retain the value such as the wire connection.
The assignment of addresses to variables mapped to memory and the
modification of all references to those variables in the specification is known as
memory address translation. To assign memory addresses, one can consider the In the message-passing model, the details of data transfer between processes are
variables in any order. The number of memory words required for each variable replaced by communication over an abstract medium called a channel, over
can be determined by variable folding. Elements of an array variable must be which the data or messages are sent. A channel is a virtual entity free of any
assigned contiguous addresses in the memory. For scalar variables, memory implementation details. Only after synthesis would a channel be implemented in
address translation is relatively straightforward. For an array variable that has the form of a bus consisting of a set of wires and protocols to sequence the data
been grouped with other variables, the addresses that index the variable must be transferred over these wires. There are many variations including uni-directional,
updated while refining references to that variable. bi-directional, point-to-point or multiway for the channel model. A further
distinction rests in whether the message-passing communication is blocking or
non-blocking. Blocking means if a process which communicates over a channel
will suspend until the other process is ready for the data transfer. The main
advantages of blocking is that two processes can synchronize at the data transfer,
and no additional storage is required. However, the drawback is that the process
will be suspend which can result in performance deterioration. Non-blocking
communication does not have this problem, however, external storage must be
implicitly associated with the channel usually in the form of a queue.

79 80
Communication (cont) Channel refinement
Shared memory M z Channels: virtual entities over which messages are
transferred
z Bus: physical medium that implements groups of
Process P Process Q Process P Process Q channels
begin begin begin Channel C begin z Bus consists of:
variable x variable y variable x variable y – wires representing data and control lines
… … … …
M :=x; y :=M; send(x); receive(y); – protocol defining sequence of assignments to data and
… … … … control lines
end end end end
z Two refinement tasks
– Bus generation: determining bus width
(a) (b)
• number of data lines
Inter-process communication paradigms:(a)shared memory, (b)message passing – Protocol generation: specifying mechanism of transfer over
bus

81 82

The slides shows two conceptual communication models which are mentioned in Concurrent behaviors in a specification communicate with another by sending
previous slides. For shared-memory communication model, two process or two message over abstract communication channels. In order to minimize the
behaviors are exchanging the info through some common shared memory. For interconnect cost, the channels in the system are grouped in such a way that each
message passing model, the processes communicate with each other by sending group of channels is implemented by bus. To coordinate the use of the bus with
the message through the channel. concurrent behaviors, some protocols have to be implemented with the bus. The
task of generating buses and their protocols for each group of channels is called
interface refinement. Therefore, the refinement of channel includes two tasks:
generation of bus with proper length and the associated protocol.
A simple case of buswidth generation in which all the channels in a group have a
identical message size is presented. In this case, channels are merged in such a
way that all channels in any group are used exclusively over time to communicate
between the same two behaviors. Consequently, each channel group is
implemented with a buswidth identical to the size of any channel. In more general
case, behaviors communicating over channels that have been grouped together
may want to transfer data over the shared physical medium simultaneously. In
addition, different channels may be transferring message of different sizes of
different sizes between the behaviors.

81 82
Protocol generation Protocol generation example
z Protocol selection type HandShakeBus is record wait until (B.START = ’0’) ;
– full handshake, half-handshake etc. START, DONE : bit ; B.DONE <= ’0’ ;
z ID assignment ID : bit_vector(1 downto 0) ; end loop;
– N channels require log2(N) ID lines DATA : bit_vector(7 downto 0) ; end ReceiveCH0;
z Bus structure and procedure definition. end record ; procedure SendCH0( txdata : in bit_vector) is
z Update variable-reference. signal B : HandShakeBus ; Begin
z Generate processes for variables. bus B.ID <= "00" ;
procedure ReceiveCH0( rxdata : out for J in 1 to 2 loop
bit_vector) is
B.data <= txdata(8*J-1 downto 8*(J-1)) ;
begin
B.START <= ’1’ ;
for J in 1 to 2 loop
wait until (B.DONE = ’1’) ;
wait until (B.START = ’1’) and (B.ID = "00") ;
B.START <= ’0’ ;
rxdata (8*J-1 downto 8*(J-1)) <= B.DATA ;
wait until (B.DONE = ’0’) ;
B.DONE <= ’1’ ;
end loop;
end SendCH0;
83 84

Various access protocol to bus can be used such as full-handshake, half- This slide shows the declaration of an eight bit bus, with two control lines and
handshake or the hardwired port. Different protocol will require different control two ID lines
lines. Bus B is declared to be a global variable so that all behaviors can access it.
The behavior that will use the bus should be assigned an unique ID to Behavior P writes to the 16- bit variable X over channel CH0. Since the buswidth
differentiate the user of the bus at any time. is only eight bits, procedures SendCH0 and ReceiveCHO transfer the 16-bit
If N channels are implemented on the same bus, log2(N) lines will be required to message associated with channel CH0 over the bus, inn two transfers of eight bits
encode the channel ID. A unique ID is assigned to each channel. The structure each.
determined for the bus is defined in the specification. For each channel mapped to
the bus, appropriate send and receive procedures are generated, encapsulating the
sequence of assignments to the bus control, data and ID lines to execute the data
transfer. References to a variable that has been assigned to another system
component by system partitioning must be updated in behaviors that ware
originally referencing it directly. Access to variables are replaced by the send
receive procedure calls corresponding to the channel over which the variable is
accessed. In order to obtain a simulatable system specification, a separate
behavior is created for each group of variables accessed over a channel.
Appropriate send and receive procedure calls are included in the behavior to
respond to access requests to the variable over the bus.

83 84
Refined specification after protocol
Arbitration schemes
generation
z Arbitration schemes determines the priorities of the
group of behaviors’ access to solve the access
conflicts.
z Fixed-priority scheme statically assigns a priority to
each behavior, and the relative priorities for all
behaviors are not changed throughout the system’s
lifetime.
– Fixed priority can be also pre-emptive.
– It may lead to higher mean waiting time.
z Dynamic-priority scheme determines the priority of a
behavior at the run-time.
– Round-robin
– First-come-first-served

85 86

In previous figure, behavior P writes the values “32” directly to the variable X. The accesses by a set of behaviors to the shared resource must be prioritized.
Channel CH0 represents the write to variable X. The statement “X<=32” is Given a group of behaviors needing to access a given resource, an arbitration
replaced by the send procedure call “sendCH0(32)” in this slide. The statement scheme determines their relative priority in order to resolve potential access
“MEM(60) :=COUNT” in behavior Q is updated to “sendCH3(60,COUNT)” , conflicts. Arbitration schemes may be classified as fixed priority and dynamic
indicating that the value in COUNT is to be written to address 60 of array MEM. priority.
In previous figure, the variable X and MEM ware assigned to different system Determining the fixed priority for various behaviors depends on some metric
components, as shown by dashed lines. In the slide, behaviors Xproc and which has to be optimized.
MEMproc have been created for these two variables. For example: mean waiting time, size of the data associated with accesses made
The protocol generation has several advantages. First, the refined specification is by a behavior and frequency of accesses made by the behavior to shared resource.
simulatable, and the design functionality, after insertion of buses and Mean waiting time: represents the average time spent by any behavior waiting to
communication protocols, can be verified. Second, by encapsulating data transfer gain access to a shared resource.
over the bus in terms of send and receive procedures, the description of the
behavior remains less cluttered than it would be if we ware to insert the Behavior should be assigned priorities in a manner that minimizes the mean
assignments for the control and data lines at each communication point in the waiting time. For s specific assignment of priorities to behaviors, the resulting
behavior. Finally, if at a later stage another communication protocol ware mean waiting time can only be determined dynamically by simulating the
selected for communication over the bus, only the bus declaration and send and specification. To determine priorities of the behaviors statically, the mean waiting
receive procedures need be changed. The system’s behavior descriptions, time is usually approximated by metrics that can be evaluated easily. One such
including the send and receive procedure calls, remain unchanged. metric is the size of the data associated with accessed made by a behavior.
A dynamic-priority scheme determines the priority of a behavior according to the
state of the system at run-time. For example, a round-robin scheme assigns the
lowest priority to the behavior that most recently accessed the shared resource. A
first-come-first-served scheme will grant access privileges to behaviors in the
order they requested the access. Such schemes are characterized by the absence of
any absolute order in which behaviors are granted access to a resource.
Consequently, dynamic arbitration will not have to wait indefinitely to gain
access to the shared resource.

85 86
Tasks of hardware/software
Hardware/Software interface refinement
interfacing
z Data access (e.g., behavior accessing variable)
refinement
z Control access (e.g., behavior starting behavior)
refinement
z Select bus to satisfy data transfer rate and reduce
interfacing cost
z Interface software/hardware components to standard
buses
z Schedule software behaviors to satisfy data
input/output rate
z Distribute variables to reduce ASIC cost and satisfy
performance
(a) Partitioned Specification (b) Mapping to Architecture
87 88

Any behaviors description can be implemented either as software or as hardware. This slide shows an example of the refinement of a hardware/software. Figure (a)
If we choose to implement it as software, the behavioral description will be shows an example of a partitioned specification. In this specification, v1 to v6
compiled into the instruction set of the chosen processor. If we choose to represent variables, B1 to B4 are behaviors, and p1 to p3 are ports through which
implement it as hardware, the behavior has to be synthesized into ASIC the behaviors communicates data with outside. The edges in the figure represent
hardware. Suppose some behaviors in the system are implemented as SW, and data accesses by the behaviors to the variables or ports. A behavior like B1 or B2
some as HW, we have to refine the interface between hardware/software that is mapped to the software partition is called a software behavior, and
behaviors. The tasks related to the HW/SW interfacing are summarized in this similarly, the one mapped to the hardware partition is called a hardware behavior
slide. The interface between two behaviors include the data and control transfer. such as B3 or B4. Variables in the behaviors can be categorized as hardware
Therefore, we have to refine the data access for those shared variables, and then variable or software variable and they can reside in different places. In this
refine control access that coordinate the execution sequence between behaviors. example, variable v1 is shared by the behaviors in the same partition and it can be
Next, an appropriate bus width, protocol and architecture should be chosen to simply taken by software compilation. Hardware variables v6 resides in the
satisfy the required data transfer rate. (AMBA is a popular standard bus used in ASIC. Variable v2, v4 shared by the behaviors in different partition can be
many SOC systems.) After the bus structure has been decided, the software and allocated into ASIC or memory. In this example, v2 will reside in the main
hardware behaviors has to be modified to be compatible to the bus standard. In memory while v4 is implemented in the hardware buffer. The hardware ASIC,
many cases, the software behaviors can be properly scheduled to meet the data software processor and the memory all communicate through a bus through
input and output rate. Finally, since some variables are shared between which the data access will happen.
behaviors, the allocation of variables to either memory, or ASIC buffer may
affect the overall performance, we have to pay attention to how to distribute these
variables.

87 88
Data and control access refinement Contents
z Four types of data access in HW/SW interface: z Function-Architecture Codesign
– Software behaviors access memory locations. – Design Representation
– Hardware behavior access memory locations. – Optimizations
– Software behaviors access ports of the ASIC’s buffer. – Cosynthesis and Estimation
– Hardware behavior access the ASIC’s buffer. z Case Study
– ATM Virtual Private Network Server
z Control access refinement’s tasks: – Digital Camera SoC
– Insert the corresponding communication protocols in the
software and the hardware behaviors.
– Insert any necessary software behaviors such as interrupt
service routines.
– Refine the accesses to any shared variables that have been
introduced by the insertion of the protocols.

89 90

This slide summarize the task of data and control access between behaviors. In this part of the codesign course, we will be mainly covering a well-known
There are four types basic types of data access that may occur between the methodology called Function-Architecture Codesign (FAC) from UC Berkeley. FAC
HW/SW interface. First, the software behaviors may access the software stresses on different phases of architecture-independent and architecture-dependent
variables which reside in the memory. This type of data access is usually carried models and model-oriented optimizations, most of which are from conventional data-
out though the processor’s load/store instructions. Second, the ASIC hardware IP flow analysis. FAC is more wide in scope than the conventional hardware-software
may access the data in the memory. This type of data access is carried out codesign in the fact that the final architecture of the system is taken into consideration
through a direct-memory access (DMA) mechanism. They are similar to load and during codesign, hence the name function-architecture codesign.
store operations except that ASIC first has to gain control of the bus. The third
possibility is the software behaviors access ports of the ASIC’s buffer. In this
scenarios, the data access can be carried out either through the process’s in/out or Two case studies will be presented to highlight the features of the methodology including
move instruction or memory-mapped I/O mechanism. Finally, the hardware IP re-use and SoC design. The ATM VPN example is from CSELT, Italy and the digital
behaviors may also access the ASIC’s buffer. This data access is usually realized camera SoC is from Chapter 7 of Frank Vahid’s book on Embedded System Design.
during the implementation of the ASIC.

Unlike data access channel which exists between behavior and variable, control
channel exists between two behaviors to indicate the starting or completion of
behavior. This control can be built by inserting hand-shaking protocols which
use some shared variables (start and done) to indicate the starting and completing
points. However, if the behaviors are implemented as software, in order to
improve the processor efficiency, usually the interrupting mechanism is
incorporated such that the processor can execute other behavior while waiting for
the completion of some behavior implemented in HW.

89 90
Function Architecture Codesign Function Architecture Codesign (FAC)
(FAC) Methodology Methodology

z FAC is a top-down (synthesis) methodology Function Trade-off Architecture

z More realistic than hardware-software codesign

Refinement

Abstraction
– More suitable for SoC design

z Maps function to architecture Synthesis Mapping Verification

– Application functions

– SoC Target Architecture


HW Trade-off SW
z Trade-offs between hardware and software
implementations

91 92

Function Architecture Codesign (FAC) is a top-down design methodology for This slide shows how functions and architectures must be mapped to generate the
hardware-software systems. FAC was initially proposed in the following final architecture through an iterative refinement process and a verification
book, which was also a Ph.D. Dissertation of Dr. Bassam Tabbara. process, while at the same time making trade-offs between hw and sw. There
Function/Architecture Optimization and Co-Design for Embedded Systems, are basically two steps:
Bassam Tabbara, Abdallah Tabbra, and Alberto Sangiovanni-Vincentelli, (1) Trade-off between function and architecture: through an estimation of the
KLUWER ACADEMIC PUBLISHERS, 2000. architecture, one is able to make some trade-offs between function and
architecture, for example, if the computation required by a function is more
than what a selected processor in an architecture can handle, then it is
Compared to the conventional hardware-software codesign, the FAC required to either
methodology is more suitable for SoC hardware–software codesign because
of the complexity of an SoC. SoC is not merely hardware and software, which a) upgrade the CPU, for example use a more powerful CPU, or
is what a conventional hardware-software codesign methodology would b) downgrade the functional requirements, for example use a simpler
assume. Rather, SoC is a complex integration of hardware IPs and software computation.
modules, which requires a more detailed mapping of application (2) Trade-off between hardware and software: When functions are mapped to
functionalities to affordable hardware components and to efficient software concrete components, either hardware or software, profiling can be down to
components. The FAC methodology is a good solution to the above mapping estimate its performance. If either performance or cost does not satisfy user-
problem since it maps functions to architecture. Hence, FAC is more suitable given constraints, then trade-offs are required so that user-given constraints
for SoC. are met.

FAC is basically a top-down synthesis methodology based on a set of different The above 2 steps constitute the refinement (synthesis) process in FAC. The
models that are differentiated into architecture independent and architecture reverse direction is verification, which requires abstraction of implemented
dependent models. FAC mainly requires two inputs as follows. systems. For example, a hardware-software implementation of a DigCam SoC
(1) a functional description of the application under development, and need to be abstracted into a high-level model to check / verify it satisfies user-
(2) a description of the target system architecture. given system-level functional or non-functional requirements.

For an SoC, the inputs will be a functional description of the SoC application and An iteration of the above two directions, namely synthesis and verification, will
an architectural description of the to-be-designed SoC. For example, the result in an optimized target system.
91 92
FAC based design of a Digital Camera (DigCam) SoC will require (1) a
FAC System Level Design Vision Platform-Based Design

Function Application Space

casts a shadow Application Instances

Refinement Platform
Specification

Constrained Optimization
System
Platform
and Co-
Co-design
Platform
Design Space
Exploration

Abstraction
Architecture Platform Instance

Architectural Space
sheds light Source: ASV

93 94

This shows how functions can be implemented on an architecture by constraining From ASV: Albert Sangiovanni Vincentelli slides
and optimization through codesign. It appears as if the function light (torch) is We first map different kinds of application requirements onto a single platform
reflected and constrained by the architecture tub (containing water?). A which is then used to derive a platform instance that represents the target
mutually constraining relationship exists between function and architecture as system design. A two-phase process is the general methodology used in a
described in the following. platform-based design.
(1) Some functions are constrained by the architecture due to either time or (1) An application is first mapped to a platform, for example, a design reference
space limitations. For example, due to memory restrictions, some functions board,
might have to use “fixed-point” computations which will result in quality
degradation of the final computation results. (2) Then, the platform-based design is then transformed into a real design which
maps the system to an architecture.
(2) A given architecture will also provide some functionalities which are not
required by a given functional description of a system. This part of the
architecture should be minimized as much as possible for a cost-optimized Since the platform plays an important role in the design process, one has to be
final system. careful not to be trapped by the platform, that is, we have to be sure of what
parts of the platform were actually used and what parts are actually required,
because otherwise we would end up having a system that uses more than what
Through the FAC methodology, one is able to consider this mutually constraining our cost estimates considered. An infeasible system would be the final result
relationship and obtain an optimized system, which is not what a conventional in this case. Thus, the platform brings convenience to a designer but at the
hardware-software codesign methodology could possibly handle. same time the flexibility provided by a platform may be a cause of under-
estimation.

FAC is like a platform-based design process but it does not have the above short-
coming because its design process is a direct mapping to the final architecture
without going through an “intermediate” platform stage.

93 94
Main Concepts Decomposition
z Decomposition z Top-down flow
z Find an optimal match between the
z Abstraction and successive refinement application function and architectural
application constraints (size, power,
z Target architectural exploration and performance).
estimation
z Use separation of concerns approach to
decompose a function into architectural units.

95 96

In FAC, the main concepts are decomposition, refinement, and exploration with Decomposition is a top-down flow involving an optimal match between
estimation. application function and architectural constraints. We need to use separation
(1) Decomposition: a complex application must be decomposed into modular of concerns approach to decompose a function into architectural units. For
functionalities or functionalities that can be implemented by some hardware example, some typical considerations are as follows.
IP or software COTS. A clever decomposition is also a key-step in a (1) Computation: consider only the computation required by a function and if the
successful FAC mapping. computation power provided by an architectural unit such as a processor can
(2) Abstraction and successive refinement: The FAC design process is an match the requirements
iteration of abstraction and refinement because a system is designed by (2) Communication: consider the overall performance such as response time,
mapping the abstraction of what an architecture can provide with the throughput, required by a function and if that provided by a chosen
refinement of functionalities in an application. interconnection architecture, such as AMBA bus, or mesh NoC (network-on-
(3) Target architectural exploration and estimation: A target architecture can still chip) can match the requirements
possess several flexibilities such as the amount of memory, etc. which need to
be explored and estimated for the final design results.

95 96
Abstraction & Successive Target Architectural Exploration and
Refinement Estimation

z Function/Architecture formal trade-off is z Synthesized target architecture is analyzed


applied for mapping function onto architecture and estimated
z Co-design and trade-off evaluation from the z Architecture constraints are derived
highest level down to the lower levels
z An adequate model of target architecture is
z Successive refinement to add details to the built
earlier abstraction level

97 98

Refinement is the process of adding details to the currently developed or designed Target architecture is synthesized through a series of estimation, analysis, and
architectural components. A design is developed using an iterative process, refinement. Traditional design space exploration techniques can be used here.
which consists of architectural abstractions and function refinements. These The choices among the types and the number of each component are made at this
two actions must meet somewhere, that is, there is a trade-off to be achieved stage. For example, the bandwidth of a bus channel, the number of ports, the
such that a feasible system satisfying all user-given constraints is constructed. interconnection capabilities of a network interface, the amount of each type of
A system always consists of parts that are abstract and parts that are already memory modules, etc. are all considered at this stage. The goal of this stage is to
refined. The abstract parts need to be refined and bound to some architectural find an optimal architecture within the user-given bounds.
parts. A match is to be found between:
(1) the functional parts that need to be “refined,” and
(2) the architectural parts whose functionalities can be “abstracted.”

97 98
Architectural Exploration in POLIS Reactive System Cosynthesis

Control- EFSM CDFG


HW/SW
dominated Representation Representation
Design

Decompose Map Map

EFSM: Extended Finite State Machines


CDFG: Control Data Flow directed acyclic Graph

99 100

This shows how FAC architectural explorations are done in POLIS. There are 3 A reactive system can be co-synthesized by decomposing the design requirements
levels: functional, mapping, and architectural. into a set of EFSM (Extended Finite State Machine) which are then mapped to a
(1) Funtional Level: Users specify system requirements at the functional level by set of CDFG (Control Data Flow Graph) by scheduling and estimation. Later,
describing two types of models: functional model and architecture model. these CDFG are then synthesized into hardware and software modules after
Each of the models are then verified for correctness. partitioning and estimation. The EFSM is more suitable for describing the
functionalities of a control-dominated design application. The CDFG is more
(2) Mapping Level: Functional model is then mapped to architecture model suitable for implementation into hardware and software. An example will be
using a data flow analysis framework. The mapping is then verified for given in later slides.
performance requirements.
(3) Architectural Level: The mapping is then refined into hardware-software
micro-architecture, which can then be optimized and verified at the micro-
architecture level.

99 100
Reactive System Cosynthesis Data Flow Optimization

EFSM S0 BEGIN
BEGIN
a:= 5 CDFG S0
Mapping
EFSM a:= 5
Case
Case(state)
(state) Optimized EFSM
S2 S1
Representation
S1 S2 S0 Representation
a:= a + 1 aa:=
emit(a)
emit(a) :=55 aa:=
:=aa++11
S1 S2
a a:=
a:=
a+61
state
state:=
:=S1
S1 state
state:=
:=S2
S2
CDFG is suitable for describing EFSM
reactive behavior but
a
Some of the control flow is hidden
Data cannot be propagated
END
END
101 102

•Transition Function is suitable for describing EFSM An EFSM representation can be optimized by term rewriting and value
•Reactive behavior of hardware or software propagation as shown in the slide. The data value of variable “a” can be
•breaks across re-invocations propagated from S0 to S1. Since “a” is not changed between S0 and S1, its value
•Data is “abstracted” for easier verification can be directly set as 6 in S1 and then emitted in S2. This saves an adder for
•People can leverage on hardware synthesis flows incrementing the variable value.
•For software it makes ESTIMATION easier
•but not for performing data flow optimization …
×Need a different view
×The slide shows an example, where the EFSM consists of 3 state S0, S1, and
S2. The corresponding CDFG is shown on the right part of the slide. There are 3
cases based on the 3 states. Corresponding assignments are shown on the
different paths in the CDFG. CDFG is more suitable for expressing reactive
behavior but some control flow is hidden and data cannot be propagated.

101 102
Abstract Codesign Flow
Function/Architecture Codesign
Design

Application

Design Decomposition

Representation I
fsm f.
fsm data
data f.
O
IDR ASICs

control
processors
data

i/o
Function/Architecture
Optimization and Codesign

Mapping

SW Partition HW Partition

Interface

Interface
RTOS HW1
HW5 Hardware/Software

BUS
Processor
HW2 Cosynthesis
HW3 HW4

103 104

We will discuss the design representations used in FAC in this section of the An Intermediate Design Representation (IDR) must be used for allowing
slides. different types of inputs and for functional mapping into a single final
architecture. The IDR must at the same time be unbiased towards either hardware
or software. The abstract codesign flow will be made concrete in future slides.

103 104
Unifying Intermediate Design
Models (in POLIS / FAC)
Representation for Codesign
z Function Model
Design
– “Esterel”
Functional Decomposition
• as “front-end” for functional specification
Architecture
Independent • Synchronous programming language for specifying reactive
Intermediate Design real-time systems
IDR Representation
– Reactive VHDL
Architecture
Constraints – EFSM (Extended FSM): support for data handling and
Dependent
asynchronous communication

z Architecture Model
SW HW
– SHIFT: Network of CFSM (Codesign FSM)

105 106

After functional decomposition, the IDR is used to represent an architecture The POLIS system is an implementation for FAC.
independent model of the system under development. Architecture independent
optimizations are performed on the IDR. Then, architectural constraints are
considered for architecture dependent optimizations giving SW and HW The function model can be composed from Esterel language specifications,
partitions that can be further synthesized and integrated. More on optimizations reactive VHDL, and a set of EFSM. Esterel is a synchronous language that has
will be discussed in later slides. been used heavily in Europe for the design of hardware and reactive systems. It is
well-supported by many academic as well as commercial tools.

The architecture model is composed from a network of CFSM (Codesign FSM),


called SHIFT (Software-Hardware Intermediate FormaT). SHIFT will be
discussed in more details later.

105 106
CFSM CFSM Network MOC
z Includes F B=>C
C=>F
– Finite state machine C=>G
C=>G G
F^(G==1)
– Data computation C=>A
C CFSM2
CFSM1
CFSM1 CFSM2
C
– Locally synchronous behavior
C=>B
– Globally asynchronous behavior A B
C=>B
z Semantics: GALS (Globally Asynchronous and (A==0)=>B
Communication
Locally Synchronous communication model) CFSM3
between CFSMs by
means of events
MOC: Model of Computation

107 108

A CFSM is basically a finite state machine along with data computations and the A network of CFSM can collaborate as a model of computation along with
GALS semantics. GALS is an important notion in developing complex communication events. The GALS concept is well implemented in this
architectures such as distributed real-time embedded systems including sensor architecture. For example, the system-level channels are basically asynchronous,
networks, etc. GALS was proposed long ago, but its actual use is only now however each CFSM is locally synchronous, that is, it is driven by some clock.
appreciated with the advent of complex systems such as multiprocessor SoC, Events such as F, G, C, A, and B are used for communication among the different
distributed sensor networks, pervasive systems, etc. The two-level CFSM processes.
synchronization approach makes the design of such systems easier and precise.

107 108
Intermediate Design Representation (Architecture) Function Flow Graph
(IDR)
z Most current optimization and synthesis are Design
Refinement Restriction
performed at the low abstraction level of a DAG Functional Decomposition
(Direct Acyclic Graph). Architecture
Independent FFG
z Function Flow Graph (FFG) is an IDR having the I/O Semantics
notion of I/O semantics.
Architecture Constraints EFSM Semantics
z Textual interchange format of FFG is called C-Like Dependent
Intermediate Format (CLIF).
AFFG
z FFG is generated from an EFSM description and can
be in a Tree Form or a DAG Form.
SW HW

109 110

The IDR in FAC is a Function-Flow Graph (FFG), which has I/O semantics and Attributed or Architecture FFG (AFFG) is the annotated version of FFG which is
can be textually written in a C-like Intermediate Format (CLIF). FFG or CLIF derived by adding architecture attributes to the original FFG. Attributes may be
can be generated from an EFSM description. There are two forms of FFG, namely derived from some hardware or software libraries and by profiling.
tree form and DAG form. FFG is a unified model that is unbiased towards either
hardware or software. Further, FFG can be optimized in several ways using
conventional software analysis approaches. FFG can also be annotated with Refinements and optimizations can be two-fold, architecture independent that
architectural attributes for implementation. Finally, the attributed FFG is work on FFG and architecture dependent that work result in AFFG and its
optimized and then mapped to either hardware circuits or software code mapping to SW and HW.
automatically.
At the architecture independent phase, I/O semantics is used for optimizations,
while at the architecture dependent phase EFSM semantics is used.

109 110
FFG/CLIF Function Flow Graph (FFG)

– FFG is a triple G = (V, E, N0) where


z Develop Function Flow Graph (FFG) / C-Like • V is a finite set of nodes
Intermediate Format (CLIF) • E = {(x,y)}, a subset of V×V; (x,y) is an edge from x to y where x
• Able to capture EFSM ∈ Pred(y), the set of predecessor nodes of y.
• Suitable for control and data flow analysis
• N0 ∈ V is the start node corresponding to the EFSM initial state.
• An unordered set of operations is associated with each node N.
• Operations consist of TESTs performed on the EFSM inputs and
Optimized
EFSM FFG CDFG internal variables, and ASSIGNs of computations on the input
FFG
alphabet (inputs/internal variables) to the EFSM output alphabet
Data Flow/Control (outputs and internal (state) variables)
Optimizations

111 112

EFSM is captured in FFG (which have I/O semantics) and then data flow/control This is the formal definition of FFG. It is a directed acyclic graph (DAG), with
optimizations are performed to obtain optimized FFG. CDFG is the final result the following features.
mapped from the optimized FFG. (1) V is a finite set of nodes that represent some sequence of computations,
(2) E is a set of edges that represent predecessor/successor relationships among
the nodes, that is, the precedence between two sequence of computations,
(3) N0 is an initial node.
(4) Each node is associated with a set of unordered operations.
(5) Operations consists of TEST and ASSIGN. TEST checks the value of EFSM
inputs and internal variables. ASSIGN defines the value of EFSM output
alphabets after some computations on the EFSM input alphabets.

111 112
C-Like Intermediate Format Preserving I/O Semantics
(CLIF) input inp;
output outp;
z Import/Export Function Flow Graph (FFG) int a = 0; S1:
goto S2;
int CONST_0 = 0;
z “Un-ordered” list of TEST and ASSIGN operations S2:
int T11 = 0; a = inp;
– [if (condition)] goto label int T13 = 0; T13 = a + 1 CONST_0;
T11 = a + a;
– dest = op(src) outp = T11;
goto S3;
• op = {not, minus, …} S3:
outp = T13;
– dest = src1 op src2 goto S3;
• op = {+, *, /, ||, &&, |, &, …}

– dest = func(arg1, arg2, …)

113 114

C-Like Intermediate Format (CLIF) is the textual representation of a Function The I/O semantics of CLIF is shown here. There are three states S1, S2, and S3,
Flow Graph (FFG). The TEST and ASSIGN operations have the above C-like with transitions from S1 to S2, from S2 to S3, and a loop from S3 to S3. inp is an
syntax. An unordered list of TEST and ASSIGN operations are associated with input and outp is an output.
each node in FFG.

113 114
FFG / CLIF Example Tree-Form FFG
Legend: constant,
constant, output flow,
flow, dead operation
S# = State, S#L# = Label in State S#

Function CLIF
Flow Graph y=1 Textual Representation
S1: x = x + y;
x = x + y;
(cond2 == 1) / output(b) (cond2 == 0) / output(a) a = b + c;
S1 a = x;
x=x+y cond1 = (y == cst1);
x=x+y cond2 = !cond1;
a= b+c if (cond2) goto S1L0
a=x output = a;
cond1 = (y==cst1) goto S1; /* Loop */
cond2 = !cond1;
S1L0: output = b;
goto S1;

115 116

S# = state This is the tree form of FFG, where clusters of nodes form a state. This gives a
S#L# = labels (branches) within state hierarchical structure to FFG.
An FFG and its corresponding CLIF representation are shown here for a simple
example. It is a rather straightforward mapping.

115 116
FAC Optimizations
Function/Architecture Codesign
z Architecture-independent phase
– Task function is considered solely and control data flow
analysis is performed

Function/Architecture – Removing redundant information and computations

z Architecture-dependent phase
Optimizations – Rely on architectural information to perform additional guided
optimizations tuned to the target platform

117 118

In this part of the lecture, we will focus on FAC optimizations that can reduce the In FAC, optimizations are classified into:
complexity of FFG and AFFG models so that the hardware and software can be (1) architecture-independent optimizations that focus on control and data flow
successfully synthesized. analysis to remove redundant information and computation, and
(2) architecture-dependent optimizations that perform target platform related
optimizations

117 118
Function Optimization FFG Optimization Algorithm
z FFG Optimization algorithm(G)
z Architecture-independent optimization objective: begin
while changes to FFG do
– Eliminate redundant information in the FFG.
Variable Definition and Uses
– Represent the information in an optimized FFG FFG Build
that has a minimal number of nodes and Reachability Analysis
associated operations. Normalization
Available Elimination
False Branch Pruning
Copy Propagation
Dead Operation Elimination
end while

119
end 120

Architecture-independent optimizations are applied to FFG to eliminate This is the FFG optimization algorithm. Most of these techniques come from
redundancies such that is has a fewer number of nodes and associated operations. classical software engineering. Refer to an SE textbook for details on these
functional optimization techniques. Some of them are introduced briefly as
follows.
(1) Variable Definition and Uses: Based on the “definition” and “use” of
variables, we can eliminate considering the value of some variables during
some parts of a computation, we can also propagate the value of variables so
that redundant information is eliminated.
(2) Reachability Analysis: A reachability graph is constructed for an FFG and
unreachable states can thus be eliminated, making the FFG more compact and
smaller in size.
(3) Normalization:
(4) Available Elimination (AE): AE is a technique that is used to collect all the
expressions that have been computed at each node and can be propagated to
successor nodes. This technique prevents repeated computation of the same
expression. Details on this technique will be illustrated in later slides.
(5) False Branch Pruning: Some branches of a choice (switch-case) will never be
taken in some contexts and can thus be eliminated.
(6) Copy Propagation: A copy of variable and expression values can be
propagated to eliminate redundant computations
(7) Dead Operation Elimination: Operations which are never executed can be
eliminated.

119 120
Optimization Approach Sample DFA Problem
Available Expressions Example
z Develop optimizer for FFG (CLIF) intermediate design
representation z Goal is to eliminate re-computations
– Formulate Available Expressions Problem
z Goal: Optimize for speed, and size by reducing – Forward Flow (meet) Problem
– ASSIGN operations AE = φ
AE = φ
– TEST operations S1 S2
t1:= a + 1
– variables t:= a + 1 t2:= b + 2

z Reach goal by solving sequence of data flow problems for AE = {a+1} AE = {a+1, b+2}
AE = {a+1}
analysis and information gathering using an underlying
S3
Data Flow Analysis (DFA) framework a := a * 5
AE = Available Expression t3 = a + 2
z Optimize by information redundancy elimination
AE = {a+2}

121 122

The two other main methods (besides iterative) are: This is a local version of the common sub-expression elimination.
•Elimination: analogous to Gaussian elimination, work on a restricted
class of flow graphs (Graham-Wegman) The idea is to eliminate re-computations. An example is given in the slide.
•Path algebra (Tarjan) Available expressions (AE) are generated on-the-fly for each node and then
The 3 properties shown are intended to address: reused if required to eliminate re-computations of similar expressions.
1) Feasibility
2) Optimality In the example, AE is initially empty for S1 and S2 since there are no
computations done before those two nodes. After S1, we have AE={a+1} and
3) Speed of convergence to optimal solution… after S2 we have AE={a+1, a+2}. The intersection of those two sets can be
Optimality really coming from “restrictions” on the input language mainly no passed to S3, namely {a+1} as the propagated AE. However, since S3 destructs
aliasing etc., as well as properties of the data flow problems solved (monotone, the value of a+1 by changing the value of variable a, thus a+1 is not propagated
distributive…), anymore beyond S3. There is a new computation, namely a+2 which can be
Kildall has developed data propagation algorithms in a general lattice theoretic propagated to nodes succeeding S3.
framework ("Global expression optimization during compilation", Proc. ACM
Conf. on PoPL, Oct. 1973
Kildall was the first to formulate data flow problems in a framework that
combines flow graph structure with lattice properties (semilattice with meet or
join operators)
Framework 4-tuple: (flow graph, semilattice, class of functions, choice of
particular defining equation (assignment map M))

121 122
Data Flow Problem Instance Data Flow Analysis Framework

z A particular (problem) instance of a monotone z A monotone data flow analysis framework D =


data flow analysis framework is a pair I = (G, (L, ∧, F) is used to manipulate the data flow
M) where M: N → F is a function that maps information by interpreting the node labels on N
each node N in V of FFG G to a function in F in V of the FFG G as elements of an algebraic
on the node label semi-lattice L of the structure where
framework D. – L is a bounded semilattice with meet ∧, and

– F is a monotone function space associated with L.

123 124

This is the formal definition of an instance of monotone data flow analysis This is the more general definition of a data flow analysis framework. An
framework, which gives a mapping from an FFG to a semilattice. This framework algebraic structure called the semi-lattice is used to analyze data in the
can be used to perform data flow analysis and optimizations on the FFG based on framework. This is the theory behind FFG optimization. (Students may just
the analysis results. understand that it is based on some solid theory and not some ad hoc analysis.)

123 124
Solving Data Flow Problems Solving Data Flow Problems
Data Flow Equations

In( S 3) = ∩ Out ( P ) z Solve data flow problems using the iterative


P∈{ S 1, S 2} method
Out ( S 3) = ( In( S 3) − Kill ( S 3)) ∪ Gen( S 3) – General: does not depend on the flow graph
AE = {φ } AE = {φ }
S1 S2
– Optimal for a class of data flow problems
t1:= a + 1
t:= a + 1 t2:= b + 2
– Reaches fixpoint in polynomial time (O(n2))
AE = {a+1} AE = {a+1, b+2}
AE = {a+1}
S3
a := a * 5
AE = Available Expression t3 = a + 2

AE = {a+2}

125 126

The data flow equations given for the example in the slide show how a new AE An iterative method is used for solving data flow problems such that fixpoints can
set is computed from propagated AE and the newly performed computations. S3 be reached in polynomial time (O(n2)).
gets the intersection of the AE from S1 and S2, labeled as In(S3). S3 propagates
only those expressions that it has received (i.e. In(S3)) and which have not been
destroy (i.e. In(S3) – Kill(S3)), plus those newly computed by S3 (i.e. Gen(S3)).

Thus, the AE propagated by S3 is Out(S3) = {a+2}.

125 126
FFG Optimization Algorithm
Function/Architecture Codesign
z Solve following problems in order to improve design:
– Reaching Definitions and Uses

– Normalization

– Available Expression Computation

– Copy Propagation, and Constant Folding

– Reachability Analysis

– False Branch Pruning

z Code Improvement techniques


Type text

– Dead Operation Elimination

– Computation sharing through normalization

127 128

This is the list of FFG optimization algorithms (refer to SE textbooks for details A diagrammatic representation of FAC. After decomposition, an intermediate
or refer to the notes of slide 32, where they were briefly explained). format such as FFG and AFFG are used to represent the function and architecture
attributed function of a system. There is a tradeoff between function and
architecture such that either function has to be degraded to meet architectural
constraints or architecture upgraded to meet functional requirements. An example
on the ATM VPN will illustrate this fact at the end of this set of slides.

127 128
Function Architecture Optimizations Architecture Dependent Optimizations

z Function/Architecture Architectural
lib
Representation: Information

– Attributed Function Flow Graph (AFFG) is used to represent


architectural constraints impressed upon the functional
behavior of an EFSM task.
EFSM FFG OFFG AFFG CDFG

Sum
Architecture
Independent

129 130

AFFG = FFG + architectural constraints This shows the transformation between one model and its optimized or attributed
AFFG is the attributed version of FFG after it has been mapped to some forms. EFSM is first converted into FFG (or CLIF textual format), which is
hardware-software system architecture, such as an AMBA-based SoC optimized into OFFG. The architectural library is considered for attributing the
architecture. AFFG is architecture platform dependent and thus has more FFG into AFFG, which is optimized and finally converted into the CDFG for
information for further optimizations based on the features or restrictions of the hardware-software implementation. In POLIS, the final CDFG is the SHIFT
architecture. model. Details can be found in later slides.

129 130
EFSM in AFFG (State Tree) Form Architecture Dependent
Optimization Objective
F4 S1
z Optimize the AFFG task representation for speed of
S0 F3
execution and size given a set of architectural
F5 constrains
F1

F0
z Size: area of hardware, code size of software
S2
F2 F6 F7

F8

131 132

This is the tree form of AFFG, which corresponds to the tree form of FFG, giving AFFG can be further optimized for speed of execution and size. For hardware, it
a hierarchical structure to the graph. is the area that is reduced and for software, it is code size which represents the
amount of memory required and thus the cost for the system.

131 132
Cost-guided Relaxed Operation Motion
Motivating Example (ROM)
1

y=a+b z For performing safe and operation from heavily


a=c
2
x=a+b executed portions of a design task to less visited
segments
3 z Relaxed-Operation-Motion (ROM):
6 begin
y=a+b Reactivity
4 5
Loop Data Flow and Control Optimization
7 y=a+b
Reverse Sweep (dead operation addition, Normalization and available
operation elimination, dead operation elimination)
z=a+b
8 a=c Forward Sweep (optional, minimize the lifetime)
Final Optimization Pass
Eliminate the redundant 9 x=a+b
end
needless runtime re-evaluation
of the a+b operation 10

133 134

This is a motivating example for optimization of AFFG. As you can see from the ROM is the optimization procedure for AFFG. The algorithm consists of three
example, there are lots of “a+b” computations which are needlessly repeated. The steps are shown in the slide.
redundant computations may be eliminated by analyzing the data flow such as (1) Reverse Sweep tries to optimize the AFFG by adding dead operations so that
using the AE technique. the AE can be reused and dead operations are also eliminated.
(2) Forward Sweep is optional and tries to minimize the lifetime of a variable by
moving the “definition” of a variable as close to its “use” as possible. A
reduced lifetime represents more efficient use of memory by reducing the
time the value of a variable needs to be stored for later use.
(3) Final Optimization Pass tries to repeat the above steps as long as changes are
made to the AFFG.

133 134
Cost-Guided Operation Motion Function Architecture Co-design
in the Micro-Architecture
Design
Cost Estimation Optimization

FFG
User Input Profiling System System
(back-end)
Constraints Specs
t1= 3*b
Attributed Decomposition t2= t1+a Decomposition

FFG emit x(t2)

Inference processors
Engine Relaxed ASICs AFFG
FFG fsm
data fsm data f.
Operation Motion data f.
control I O
i/o
Instruction Selection Operator Strength Reduction
135 136

Cost estimation is required for ROM optimizations, which is performed by Micro-architecture optimizations result in optimized AFFG. For example,
profiling through an inference engine. For example, the code size and variable instruction selection and operator strength reduction can be performed on the
lifetime are estimated only through profiling. function and the architecture, respectively, for optimizing AFFG.

135 136
Operator Strength Reduction Architectural Optimization

t1= 3*b expr1 = b + b; z Abstract Target Platform


– Macro-architectures of the HW or SW system design tasks
t2=t1 + a
t1 = expr1 + b; z CFSM (Co-design FSM): FSM with reactive behavior
x=t2
t2 = t1 + a; – A reactive block

– A set of combinational data-flow functions


x = t2;
z Software Hardware Intermediate Format (SHIFT)
– SHIFT = CFSMs + Functions
Reducing the multiplication operator

137 138

This shows how multiplication operators can be reduced to the weaker and An abstract target platform is used to represent the macro-architectures of
cheaper addition operators. The example shows that an expensive multiplication hardware and software design tasks.
can be reduced to 3 cheap additions.
CFSM (Co-design FSM) is an FSM with reactive behavior, that is, a reactive
block with a set of combinational dataflow functions. A CFSM is a basic
mapping unit for hardware and software tasks.

SHIFT (Software Hardware Intermediate Format) is a network of CFSMs along


with functions. SHIFT is used to represent the architecture of a system.

137 138
Architectural Organization of a
Macro-Architectural Organization
Single CFSM Task

SW Partition HW Partition CFSM

Interface

Interface
RTOS HW1
HW5

BUS
HW2
Processor HW4
HW3

139 140

Macro-architectural organization of the mapped architecture. It is assumed that This shows the architectural organization of a single CFSM task. It is basically a
there is only ONE software partition which includes the RTOS running a CDFG in a state.
microprocessor. The hardware can be partitioned into more than one partition.
This assumption is valid for a single processor system, which has only one
software running. The hardware and software partitions communicate through
bus interface.

139 140
Task Level Control and Data Flow CFSM Network Architecture
Organization
z Software Hardware Intermediate FormaT (SHIFT) for
describing a network of CFSMs
c Reactive Controller
s
z It is a hierarchical netlist of
b EQ
a_EQ_b – Co-design finite state machine
a
INC_a – Functions: state-less arithmetic, Boolean, or user-defined
INC a operations
RESET_a
MUX
y
RESET

0
1

141 142

This shows the hardware implementation of a CFSM. The reactive controller is SHIFT = CFSM + functions
used to control the outputs based on the inputs. This hardware implementation SHIFT is a hierarchical netlist of CFSM with functions that are state-less
corresponds to the CFSM in the previous slide. arithmetic just like combinational circuits.

141 142
SHIFT: CFSMs + Functions Architectural Modeling
z Using an AUXiliary specification (AUX)

z AUX can describe the following information


– Signal and variable type-related information

– Definition of the value of constants

– Creation of hierarchical netlist, instantiating and


interconnecting the CFSMs described in SHIFT

143 144

An example of SHIFT: 2 HW partitions and 1 SW partition Auxiliary specifications include other details of a SHIFT architecture such as
variable types, constant values, etc. Auxiliary specifications also specify the
hierarchical netlist among CFSMs in a SHIFT architecture.
FAC assumes there is only 1 final SW partition corresponding to a single
processor system, FAC can also be extended to multiprocessor systems.

In this example, CFSM1 and CFSM2 are in hardware partition 1 and CFSM3 is in
hardware partition 2. The results of CFSM1 can be propagated to CFSM3 through
an e1 channel. Similarly, CFSM4, CFSM5, CFSM6, CFSM7 are all in a single
software partition. The hardware and software communicate through two
channels e2 and e3. Ports are used to abstract the different communication
interfaces either among partitions or among CFSMs within a single partition. For
example, port3 is a bi-directional port for communication between the hardware
CFSM2 and the software CFSM7.

143 144
Architecture Dependent
Mapping AFFG onto SHIFT
Optimizations
z Synthesis through mapping AFFG onto SHIFT and AUX
(Auxiliary Specification) z Additional architecture Information leads to an
increased level of macro- (or micro-) architectural
z Decompose each AFFG task behavior into a single reactive optimization
control part, and a set of data-path functions.

Mapping AFFG onto SHIFT Algorithm (G, AUX) z Examples of macro-arch. Optimization
begin – Multiplexing computation Inputs

foreach state s belong to G do – Function sharing


build_trel (s.trel , s, s.start_node, G, AUX);
z Example of micro-arch. Optimization
end foreach
– Data Type Optimization
end

145 146

This shows how an AFFG is mapped onto SHIFT and AUX. Each AFFG task is Architecture dependent optimizations include macro and micro level ones.
decomposed into a single reactive control part and a set of data-path functions. Examples for macro-architecture optimization include multiplexing computation
The transition relationship (trel) is built for each state s in the FFG G. inputs and sharing functions. Details are in later slides. Example for micro-
architecture optimization include data type optimization.

145 146
Distributing the Reactive Controller Multiplexing Inputs
Move some of the control into data path as an ITE assign expression
c
d Reactive c=a + T(b+c)
e Controller
s b
a
T=b+c Control + T(b+a)
MUX {1, 2}
b

e
a 1
a Tout d -c-
ITE
b ITE
out c 2
+ T(b+-c-)
c
b
ITE: if-then-else

147 148

This shows how a complex reactive controller can be simplified by moving some This shows how inputs can be multiplexed so that a reactive controller can be
control into the data path and using a multiplexer. The diagram in the slide shows simpler. In the above example, c may take different values when T=b+c is
how some control can be implemented as ITE (if-then-else) in the data path and computed. To account for the different values, a 2x1 multiplexer can be used for
thus the reactive controller becomes simpler for implementation. This is a type of the input c. In this way, only one adder is required, instead of the original two
macro-architecture optimization. adders.

147 148
Micro-Architectural Optimization
z Available Expressions cannot
Function/Architecture Codesign
eliminate T2
S1 z But if variables are registered
T1 = a + b;
x = T1;
(additional architectural
information) we can share
Hardware/Software
a = c;
T1 and T2
Cosynthesis and Estimation

a x
S2
T(a+b)
T2 = a + b;
Out = T(a+b); +
b Out
emit(Out)
149 150

Micro architectural information allow better optimizations. In the above example, This part introduced cosynthesis and estimation techniques in FAC.
if variables are stored in registers, then only one adder, instead of two, is
necessary to implement the above FFG. This is a type of micro-architecture
optimization.

149 150
Co-Synthesis Flow Concrete FA Codesign Flow
Function Architecture
FFG Interpreter (Simulation) Graphical Reactive
Esterel Constraints
EFSM VHDL Specification

CDFG Decomposition

Estimation and Validation


EFSM FFG AFFG
SHIFT FFG Behavioral
Functional SHIFT
Optimization
Optimization (CFSM Network)

AFFG

Or Macro-level AUX
Optimization Modeling Cost-guided
Software Hardware
Compilation Synthesis Micro-level Resource Optimization
Optimization Pool

SW Partition HW Partition
HW/SW

Interface
Interface
RTOS HW1
HW5

BUS
Processor
HW2 RTOS/Interface
Object HW4
Netlist
HW3
Co-synthesis
Code (.o)

151 152

This shows the overall cosynthesis flow consisting of all model transformations, A FAC codesign flow is shown in the slide. Graphical EFSM, Esterel language
optimizations, and the final SHIFT architecture that can be implemented in specifications, and reactive VHDL are the inputs to FAC along with architectural
hardware and software. This flow is integrated in the POLIS tool. Software can constraints. After functional decomposition, a set of FFGs (function flow graphs)
be compiled into object code and hardware can be synthesized into netlist. are generated which then undergo functional optimizations under the SHIFT
architectural constraints resulting in Attributed FFGs (AFFGs). Macro-level and
micro-level optimizations are then applied to AFFG and finally the full system
architecture.

151 152
POLIS Co-design Environment POLIS Co-design Environment
Graphical EFSM ................ z Specification: FSM-based languages (Esterel, ...)
ESTEREL
z Internal representation: CFSM network
Compilers z Validation:
– High-level co-simulation

SW Synthesis CFSMs HW Synthesis – FSM-based formal verification

– Rapid prototyping
SW Estimation Partitioning HW Estimation
z Partitioning: based on co-simulation estimates

Logic Netlist z Scheduling


SW Code + Performance/trade-
Performance/trade-off
RTOS Evaluation z Synthesis:
Programmable Board – S-graph (based on a CDFG) based code synthesis for software
• μP of choice
Physical Prototyping • FPGAs – Logic synthesis for hardware
• FPICs
z Main emphasis on unbiased verifiable specification

153 154

This is the POLIS codesign flow that implements FAC. CFSM is the core model Some essential phases and parts of the POLIS codesign environment are shown
used for hardware-software cosynthesis. For software, s-graph is used for here. Software synthesis is based on S-graph (software graphs) which can be used
profiling and synthesis. to automatically generate software code. The synchronous Esterel language is
used for system requirement input, which can be formally validated. The internal
representation is a CFSM network. Validation can be performed in three ways:
high-level cosimulation, formal verification, and rapid prototyping. Partitioning
and scheduling are performed on the CFSM.

153 15469
Hardware/Software Co-Synthesis RTOS Synthesis and Evaluation in
POLIS
z Functional GALS CFSM model for hardware and CFSM Resource
Network Pool
software
 initially unbounded delays refined after architecture mapping

z Automatic synthesis of: HW/SW RTOS


1. Provide communication Synthesis Synthesis
• Hardware mechanisms among CFSMs
implemented in SW and
• Software
between the OS and the HW
• Interfaces partitions.
2. Schedule the execution of Physical
• RTOS the SW tasks. Prototyping

155 156

Hardware software cosynthesis in FAC / POLIS includes the functional GALS This shows the flow of RTOS synthesis and evaluation in POLIS. S-graph can be
(globally asynchronous locally synchronous) CFSM for hardware and software used to provide communication mechanisms among CFSMs implemented in
implementations and the automatic synthesis of hardware, software, interfaces, software and between the OS and the hardware partitions. It can also be used to
and RTOS. schedule the execution of software tasks.

155 156
Estimation for CDFG Synthesis Estimation-Based Co-simulation

BEGIN
Network
Networkof
of
EFSMs
EFSMs
40
detect(c) a<b
T
41 63
26 F F T
HW/SW
HW/SWPartitioning
Partitioning
a := 0 a := a + 1 SW
SWEstimation
Estimation HW
HWEstimation
Estimation
9
18

emit(y)
HW/SW
HW/SWCo-Simulation
Co-Simulation
14 Performance/trade-off
END Performance/trade-off Evaluation
Evaluation

Costs in clock cycles on a 68HC11 target processor


using the Introl compiler for a simple CFSM
157 158

This shows how software costs are estimated for a CDFG in FAC/POLIS. The Cosimulation requires cost and performance estimations for both software and
costs are in terms of clock cycles on a 68HC11 target processor using the Introl hardware. Hardware-software cosimulation evaluates the performance of a
compiler for a simple CFSM. The different weights (number of clock cycles) on hardware-software partition and suggests some trade-offs for improving the
the branches from a branch node represent the different number of clock cycles performance.
required for different branch cases. The CDFG can then be used to
schedule/synthesize the software.

157 158
Co-simulation Approach Co-simulation Approach

z Fills the “validation gap” between fast and slow models z Models of mixed hardware, software, RTOS and interfaces

– Performs performance simulation based on software z Mimics the RTOS I/O monitoring and scheduling
and hardware timing estimates – Hardware CFSMs are concurrent

z Outputs behavioral VHDL code – Only one software CFSM can be active at a time

– Generated from CDFG describing EFSM reactive function z Future Work


– Annotated with clock cycles required on target processors
– Architectural view instead of component view
z Can incorporate VHDL models of pre-existing components

159 160

Hardware logic simulation models are intrinsically slow while software execution This cosimulation approach consists of mixed modules in hardware, software,
models such as ISS (instruction set simulator) can be very fast. Models are RTOS, and interfaces. Models are often at different abstraction levels, which
different abstractions (such as FIFO channels and buffer implementations) also increases the complexity of cosimulation and also necessitates the use of
simulate at different speeds. There is a large gap between how far the models can wrappers for interconnecting components at different abstraction levels.
execute. Hardware-software cosimulation tries to fill this gap through memory
models and various kinds of optimizations.

Simulation is done based on software and hardware timing estimates and not the
real hardware or software modules themselves.

In FAC/POLIS, behavioral VHDL code is generated from CDFG describing


EFSM reactive function and then annotated with clock cycles required on target
processors. The methodology can also reuse VHDL models of existing
components.

159 160
Coverification Tools Seamless CVE Performance Analysis

z Mentor Graphics Seamless Co-Verification


Environment (CVE)

161 162

This is a well-known and widely used cosimulation tool from Mentor Graphics CVE v5.0 can display three types of profiling information during cosimulation,
called Seamless CVE (Co-Verification Environment). CVE supports integration including the bus loading, the code profile (which function took what time), and
and concurrent execution of one or more processor ISS (instruction set simulator) memory transactions (to see where is the major bottleneck in hw/sw
with a hardware logic simulator. CVE v5.0 also supports the execution of C communication). A user after observing these information can try to improve the
models through the C Bridge. Software is simulated by supporting different design for better performance, for example remove the bottleneck in
processor instruction set simulators in different packages such as ARM 920T, communication.
Palm, (supported by CiC now). There are other PSP (processor support package)
not currently available in CiC and must be purchased separately. As far as logic
simulator is concerned, Mentor Graphics’ ModelSim is a popular one which is
supported by CVE. CVE uses a coherent memory server to optimize the
execution speeds of the different rated ISS and logic simulator.

161 162
References Codesign Case Studies
z Function/Architecture Optimization and Co-Design for
Embedded Systems, Bassam Tabbara, Abdallah z ATM Virtual Private Network
Tabbra, and Alberto Sangiovanni-Vincentelli,
– CSELT, Italy
KLUWER ACADEMIC PUBLISHERS, 2000
z The Co-Design of Embedded Systems: A Unified – Virtual IP library reuse
Hardware/Software Representation, Sanjaya Kumar,
z Digital Camera SoC
James H. Aylor, Barry W. Johnson, and Wm. A. Wulf,
KLUWER ACADEMIC PUBLISHERS, 1996 – Chapter 7, F. Vahid, T. Givargis, Embedded System Design

z Specification and Design of Embedded Systems, – Simple camera for image capture, storage, and download
Daniel D. Gajski, Frank Vahid, S. Narayan, & J. Gong,
– 4 design implementations
Prentice Hall, 1994

163 164

These are the references for the materials in this set of slides for codesign. The Two case studies will be discussed for hw/sw codesign as listed in the following.
FAC methodology is from the first reference book. Gajski’s book is mainly on (1) ATM (Asynchronous Transfer Mode) VPN (Virtual Private Network) node,
high-level synthesis, but it is a very good reference book for all sorts of high-level and
design algorithms and methods.
(2) A digital camera SoC

The ATM example was developed by CSELT from Italy and the digital camera is
an example from the Chapter 7 of Frank Vahid’s book on Embedded System
Design. These two examples illustrate how hardware-software codesign must
be practiced. The ATM example shows how POLIS is integrated with the
reuse of IP from the VIP library. The camera example shows how different
hardware-software partitions can be explored and finally how an SoC can be
designed to meet all user-given requirements.

163 164
CODES’2001 paper ATM EXAMPLE OUTLINE
INTELLECTUAL PROPERTY RE-
USE INTRODUCTION
IN EMBEDDED SYSTEM CO- THE POLIS CO-
CO-DESIGN METHODOLOGY
DESIGN: AN INDUSTRIAL CASE
IP INTEGRATION INTO THE CO-
CO-DESIGN FLOW
STUDY
E. Filippi, L. Lavagno, L. Licciardi, A. Montanaro, THE TARGET DESIGN: THE ATM VIRTUAL
M. Paolini, R. Passerone, M. Sgroi, A. Sangiovanni-
Sangiovanni-Vincentelli PRIVATE NETWORK SERVER

Centro Studi e Laboratori Telecomunicazioni S.p.A., Italy RESULTS

Dipartimento di Elettronica - Politecnico di Torino, Italy CONCLUSIONS

EECS Dept. - University of California, Berkeley, USA


165
166

This case study is about the design of an ATM (asynchronous transfer mode) Outline
network VPN (virtual private network) node. It is developed by CSELT from Italy . The key aspects in the architectural evolution of switching nodes and the
in collaboration with University of California, Berkeley, USA. Mainly, we will technological key issues.
introduce POLIS, how VIP library is used with POLIS to design the ATM node
hardware-software, and how POLIS must be extended to deal with real system . An example of a switching architecture is given as the state of art for an
design. industrial product.
. Architectural and technological solutions to expand switching capacity using
currently available technology.
. How the optical technology is used to enhance the performance of packet
switching nodes.
. Conclusions.

165 1 166 2
NEEDS FOR EMBEDDED SYSTEM THE POLIS EMBEDDED SYSTEM
DESIGN CO-DESIGN ENVIRONMENT
CODESIGN EASY DESIGN SPACE EXPLORATION
METHODOLOGY HW-SW CO-DESIGN FOR CONTROL-DOMINATED
EARLY DESIGN VALIDATION
AND TOOLS REAL-TIME REACTIVE SYSTEMS
– AUTOMOTIVE ENGINE CONTROL, COMMUNICATION
PROTOCOLS, APPLIANCES, ...
DESIGN METHODOLOGY
– FORMAL SPECIFICATION: ESTEREL, FSMS
INTELLECTUAL HIGH DESIGN PRODUCTIVITY – TRADE-OFF ANALYSIS, PROCESSOR SELECTION, DELAYED
PROPERTY PARTITIONING
HIGH DESIGN RELIABILITY – VERIFY PROPERTIES OF THE DESIGN
LIBRARY
– APPLY HW AND SW SYNTHESIS FOR FINAL IMPLEMENTATION
– MAP INTO FLEXIBLE EMULATION BOARD FOR EMBEDDED
VERIFICATION

167 168

This case study shows why IPs may be reused in hw/sw codesign for high design This is a brief description of the POLIS codesign environment (as we discussed
productivity and reliability. earlier in FAC). The design methodology is summarized here for introducing this
example.

167 5 168
THE POLIS CODESIGN FLOW TM
Frame Sync.
SCRAMBLER
Sample
GRAPHICAL EFSM ESTEREL … DESCRAMBLER
Sample
SCRAMBLER
SDH
LIBRARY OF CUSTOMIZABLE IP SOFT CORES CELINE
FORMAL DISCRETE ATMFIFO
VERIFICATION COMPILERS APPLICATION AREAS: COSINE
VITERBI
TRANSFORM CELINE
DECODER
CIDGEN
FAST PACKET SWITCHING (ATM, TCP/IP) REED SOLOMON
DECODER
REED SOLOMON LQM
CFSMs VIDEO AND MULTIMEDIA ENCODER
SORTCORE
ARBITER
HW SYNTHESIS UTOPIA
LEVEL2
SW SYNTHESIS WIRELESS COMMUNICATION MPI
UTOPIA
SRAM_INT LEVEL1
PARTITIONING UPCO/DPCO
VERCOR

SOURCE CODE: RTL VHDL C2W


AC
SW ESTIMATION HW ESTIMATION
TM MC68K ATM-GEN
MODULES HAVE:
HW/SW CO-
CO-SIMULATION
PERFORMANCE / TRADE-
TRADE- LOGIC MEDIUM ARCHITECTURAL COMPLEXITY
SW CODE + OFF EVALUATION NETLIST HIGH RE-
RE-USABILITY Video &
Multimedia
Fast Packet
Switching
RTOS CODE HIGH PROGRAMMABILITY DEGREE SOFTLIBRARY

PHYSICAL PROTOTYPING REASONABLE PERFORMANCE

169 170

This is the POLIS codesign flow (as discussed earlier in FAC). This shows how This is the VIP library that can be used for DSP applications. All IPs are available
ATM VPN is going to be designed. in RTL VHDL code. The IPs mainly cover fast packet switching such as ATM,
TCP/IP and video and multimedia functions. The features of VIP library modules
are: medium architectural complexity, high usability, high programmability, and
reasonable performance.

169 170
GOALS CASE STUDY: AN ATM VIRTUAL
PRIVATE NETWORK SERVER
Î ASSESSMENT OF POLIS ON A TELECOM SYSTEM WFQ
DESIGN SCHEDULER

 CASE STUDY: ATM VIRTUAL PRIVATE NETWORK SERVER


CONGESTION
CLASSIFIER CONTROL
(MSD)

ATM OUT
A-VPN SERVER ATM IN
TO XC
FROM XC
(155 Mbit/s)
Mbit/s)
TM (155 Mbit/s)
Mbit/s)
SUPERVISOR
Î INTEGRATION OF THE IN THE
DISCARDED
POLIS DESIGN FLOW CELLS

171 172

The goal of this case study is to show how an ATM VPN server can be designed This is the architecture of an ATM VPN server which takes an input from the
using codesign technology and the VIP library. ATM network at the speed of 155 Mbit/s and must output the cells at the same
rate. In this ATM VPN server node, there is a congestion control algorithm called
MSD (message selective discarding) which discards cells of an entire message if
the buffer queues are full, otherwise the cells are queued up in their respective
channels. A WFQ (weighted fair queuing) scheduler is used to schedule and
dispatch the cells out to the ATM network based on a real-time sorter.

171 172
CRITICAL DESIGN ISSUES DESIGN IMPLEMENTATION
DATA PATH: TX INTERFACE
ATM
 7 VIP LIBRARYTM 155
Î TIGHT TIMING CONSTRAINTS MODULES Mbit/s SHARED
FUNCTIONS TO BE PERFORMED WITHIN A CELL TIME SLOT  2 COMMERCIAL RX INTERFACE BUFFER
(2.72 μs FOR A 155 Mbps FLOW) ARE: MEMORIES MEMORY
 PROCESS ONE INPUT CELL  SOME CUSTOM LOGIC
 PROCESS ONE OUTPUT CELL (PROTOCOL
TRANSLATORS) LOGIC QUEUE
 PERFORM MANAGEMENT TASKS (IF ANY) INTERNAL MANAGER
CONTROL UNIT: ADDRESS
 25 CFSMs LOOKUP
Î FREQUENT ACCESS TO MEMORY TABLES ALGORITHM
THAT STORE ROUTING INFORMATION FOR EACH CONNECTION
AND STATE INFORMATION FOR EACH QUEUE „ VIP LIBRARY MODULES
TM ADDRESS
LOOKUP SUPERVISOR
„ HW/SW CODESIGN MODULES MEMORY

„ COMMERCIAL MEMORIES VC/VP SETUP


AGENT
173 174

These are the design constraints. For 155 Mbps flow, we need to process one cell The different colors differentiate which modules were implemented using the VIP
within 2.72 microseconds which is very short time, so we need both hardware and library, which modules were bought from the market, and which modules were
software to handle all those data cells. There is frequent access to memory tables codesigned. There are 7 VIP modules, 2 commercial memory modules, and 25
that store routing information for each connection and state information for each CFSMs that need to be implemented in hw and sw.
queue. Thus, memory access must be designed efficiently

173 174
ALGORITHM MODULE SUPERVISOR BUFFER MANAGER

QUEUE_STATUS
^IN_THRESHOLD
ARCHITECTURE

QUERY_QUID
ADD_DELN
UPDATE_

^IN_QUID

PUSH_QUID
^IN_FULL
^IN_EID

^IN_BW
^IN_CID

POP_QUID
TABLES

READY

^FIFO_OCCUPATION
OP

^PTI
CID
ALGORITHM
TX lqm_arbiter2
ATM INTERFACE LQM
155 supervisor
supervisor
Mbit/s SHARED INTERFACE

QUERY
BUFFER

ACCEPTED
RX

RESSEND

PUSH

POP
INTERFACE MEMORY
MSD CELL
TECHNIQUE EXTRACTION
LOGIC QUEUE collision_detector msd_technique extract_cell2
INTERNAL MANAGER msd_technique extract_cell2
ADDRESS
VIRTUAL

TOUT_FROM_SORTER
QUID_FROM_SORTER
LOOKUP GRANT_ GRANT_

READ_SORTER
POP_SORTER
ARBITER_ ARBITER_
ALGORITHM CLOCK state SC_2 SC_1

SCHEDULER REAL TIME


ADDRESS arbiter_sc
quid
LOOKUP SUPERVISOR SORTER
MEMORY
last arbiter_sorter
INTERNAL COMPUTED_ COMPUTED_
VIRTUAL_ VIRTUAL_
VC/VP SETUP GRANT_
AGENT TABLES bandwidth
TIME_2 TIME_1
ARBITER_
REQUEST_
ARBITER_
SORTER_2 SORTER_2

threshold space_controller
space_controller INSERT

COMPUTED_OUTPUT_TIME
sorter
sorter
full READ

175 WRITE 176

Within the full ATM VPN data flow, on the right we have the architecture of the This is the hardware schematic for an ATM VPN server. There are three arbiters,
MSD algorithm module, which is going to be codesigned. Supervisor will be in a real-time sorter, a supervisor, a collision detector, and the MSD technique.
software.

175 176
IP INTEGRATION DESIGN SPACE EXPLORATION
POLIS MODULE z CONTROL UNIT
EVENTS SPECIFIC
 Source code: 1151 ESTEREL lines
AND INTERFACE MODULE SIZE I II
VALUES PROTOCOL  Target processor family: MIPS 3000
MSD TECHNIQUE 180 SW HW
(RISC)
POLIS 85 HW SW
TM CELL EXTRACTION
z FUNCTIONAL VERIFICATION
HW PROTOCOL VIRTUAL CLOCK 95 HW HW
 Simulation (PTOLEMY)
CODESIGN SCHEDULER
TRANSLATOR MODULE z SW PERFORMANCE ESTIMATION
REAL TIME SORTER 300 HW HW
MODULE  Co-simulation (POLIS VHDL model
generator) ARBITER #1 33 HW HW

z RESULTS ARBITER #2 34 SW HW
POLIS TM  544 CPU clock cycles per time slot ARBITER #3 37 HW SW
POLIS
SW PROTOCOL  Ð 200 MHz clock frequency LQM INTERFACE 75 HW HW
SW/HW
CODESIGN TRANSLATOR MODULE SUPERVISOR 120 SW SW
INTERFACE PROCESSOR FAMILY CHANGED
MODULE (MOTOROLA PowerPCTM )

177 178

This slide shows that we need to design interfaces between VIP module and our This slide shows why the processor choice was changed to Motorola PowerPC
codesigned module because different protocols are used by them. We also need to because the MIPS 3000 RISC CPU could not give the desired performance.
develop the sw/hw interface between POLIS sw codesign module and a VIP
library module. Mainly protocol translators must be designed and tested by
integration. High-level synthesis algorithm from Gajski’s book may be used here On the right, we have two different hardware/software partitions. The MSD
for protocol translator design. The hardware-software interface design method can technique, the cell extraction, arbiter #2, and arbiter #3 are implemented
also be looked up in Gajski’s book. differently in the two partitions. The virtual clock scheduler, the real-time sorter,
and arbiter #1 need high performance, hence they are all implemented in
hardware for both the partitions. The supervisor does not have real-time
requirements hence it is implemented in software for both partitions.

177 178
DESIGN VALIDATION CONTROL UNIT MAPPING RESULTS
MODULE FFs CLBs I/Os GATES
z VHDL co-simulation of the complete design
MSD TECHNIQUE 66 106 114 1,600
 Co-design module code generated by POLIS
CELL EXTRACTION 26 35 66 564
z Server code: ~ 14,000 lines
VIRTUAL CLOCK SCHEDULER 77 71 95 1,280
 VIP LIBRARYTM modules: ~ 7,000 lines
 HW/SW co-design modules: ~ 6,700 lines REAL TIME SORTER 261 731 52 10,504
 IP integration modules: ~ 300 lines
ARBITER #1 9 7 9 114
z Test bench code: ~ 2,000 lines ARBITER #2 10 7 10 127
 ATM cell flow generation
ARBITER #3 16 9 17 159
 ATM cell flow analysis
 Co-design protocol adapters LQM INTERFACE 20 39 27 603

PARTITION I 409 892 120 13,224

PARTITION II 443 961 256 14,228


179 180

This shows the size of the HDL code for both the design and the testbench. As we This table shows the hardware features of the different modules found in the
can see, reusing libraries save us around half of the code to be written. But, of control path and the hardware sizes of the two partitions. The listed hardware
course, we still need to write the testbench (2000 lines of HDL). The testbench features include FFs (Flip Flops), CLBs (Control Logic Blocks), I/Os
consists mainly of ATM cell flow generation, analysis, and codesign protocol (Input/Output blocks), and gates. The first partition is smaller in gate count than
adapters. The codesign modules were automatically generated by POLIS. the second partition.

179 180
WHAT DO WE NEED FROM POLIS
DATA PATH MAPPING RESULTS ?
Î IMPROVED RTOS SCHEDULING POLICIES
MODULE FFs CLBs I/Os GATES AVAILABLE NOW:
 ROUND ROBIN
UTOPIA RX INTERFACE 120 251 37 16,300
 STATIC PRIORITY
UTOPIA TX INTERFACE 140 265 43 16,700 NEEDED:
 QUASI-STATIC SCHEDULING POLICY
LOGIC QUEUE MANAGER 247 332 31 5,380

ADDRESS LOOKUP 87 96 82 1,700


Î BETTER MEMORY INTERFACE MECHANISMS
ADDRESS CONVERTER 14 13 17 240
AVAILABLE NOW:
PARALLELISM CONVERTER 46 31 47 480  EVENT BASED (RETURN TO THE RTOS ON EVENTS
GENERATED BY MEMORY READ/WRITE OPERATIONS)
DATAPATH TOTAL 658 1001 47 42,000
NEEDED:
 FUNCTION BASED (NO RETURN TO THE RTOS ON EVENTS
GENERATED BY MEMORY READ/WRITE OPERATIONS)
181
182

This table shows the hardware features of the modules that are found in the data So, does POLIS satisfy our requirements for a codesign/cosimulation
path of the ATM VPN server node architecture. The data path is the same for the environment that can reuse IP modules? There is a lack of static scheduling
two partitions because codesign is performed only on the control path modules. techniques and memory interface mechanisms currently in POLIS.

181 182
WHAT ELSE DO WE NEED FROM ATM EXAMPLE CONCLUSIONS
POLIS ?
HW/SW CODESIGN TOOLS PROVED HELPFUL IN REDUCING
Î MOST WANTED: EVENT OPTIMIZATION DESIGN TIME AND ERRORS
EVENT DEFINITION Ð INTER-MODULE COMMUNICATION PRIMITIVE CODESIGN TIME = 8 MAN MONTHS
BUT: STANDARD DESIGN TIME = 3 MAN YEARS
 NOT ALL OF THE ABOVE PRIMITIVES ARE ACTUALLY NECESSARY
 UNNECESSARY INTER-MODULE COMMUNICATION LOWERS
PERFORMANCE
POLIS REQUIRES IMPROVEMENTS TO FIT INDUSTRIAL
TELECOM DESIGN NEEDS
EVENT OPTIMIZATION + MEMORY ACCESS + SCHEDULING POLICY
Î SYNTHESIZABLE RTL OUTPUT
SYNTHESIZABLE OUTPUT FORMAT USED: XNF EASY IP INTEGRATION IN THE POLIS DESIGN FLOW
PROBLEM: COMPLEX OPERATORS ARE TRANSLATED INTO EQUATIONS
 DIFFICULT TO OPTIMIZE FURTHER IMPROVEMENTS IN DESIGN TIME AND RELIABILITY
 CANNOT USE SPECIALIZED HW (ADDERS, COMPARATORS…)

183 184

Also, there is a lack of event optimization to enhance performance. Further, The ATM case study was developed in 8 man months instead of the 3 man years
synthesizable RTL output is also needed from POLIS. that would have been required without codesign/cosimulation environment such
as POLIS. This case study showed how IP reuse needs integration techniques that
may not be provided by a system-level design tool and must be hand-crafted.

183 18418
Outline
z Introduction to a simple digital camera
z Designer’s perspective
z Requirements specification
Digital Camera SoC Example z Design
– Four implementations

F. Vahid and T. Givargis, Embedded Systems Design: A Unified


Hardware/Software Introduction, Chapter 7.
185
186

This is Chapter 7 of Frank Vahid’s book “Embedded Systems Design: A Unified The example is a simple digital camera SoC, with image capture, storage, and
Hardware/Software Introduction”. transfer capabilities. We will discuss the system requirements from a user’s point
of view and then give four implementation choices starting from a pure software
one. Since the performance requirement is a maximum time interval of one
The digital camera SoC is used an illustration example for the hardware-software second between two presses of the camera shutter (image captures), some
codesign part of this course. hardware is introduced gradually as we improve the performance. Software-based
approximations such as fixed point representation is also used to enhance
It shows how different design choices can be made by trade-off between performance in one of the design architectures.
hardware and software. It also shows how a digital camera SoC may be designed.

185 186
Introduction to a simple digital
Designer’s perspective
camera
z Captures images z Two key tasks
z Stores images in digital format
– Processing images and storing in memory
– No film
• When shutter pressed:
– Multiple images stored in camera
• Number depends on amount of memory and bits used per image – Image captured
z Downloads images to PC – Converted to digital form by charge-coupled device (CCD)
z Only recently possible – Compressed and archived in internal memory
– Systems-on-a-chip – Uploading images to PC
• Multiple processors and memories on one IC • Digital camera attached to PC
– High-capacity flash memory • Special software commands camera to transmit archived images
z Very simple description used for example serially
– Many more features with real digital camera
• Variable size images, image deletion, digital stretching, zooming in and out, etc.

187 188

The main functions of the simple digital camera are as follows. The details of the two key tasks, namely image capture and download, are given
(a) Captures images, here. A CCD device is used for digital image capture and it is archived in internal
memory. The image must be downloadable to a PC through some serial
(b) Stores images in digital format, and communication interface.
(c) Downloads images to PC.

The full set of camera functions could be designed on a single chip only recently
due to the advance in fabrication technology resulting in the possibility of
SoC designs.

The camera example is a simple one. More complex functions such as image
processing also exists in real digital cameras and can be incorporated into this
example.

187 188
Charge-coupled device (CCD) Zero-bias error
z Special sensor that captures an image z Manufacturing errors cause cells to measure slightly above or below actual
z Light-sensitive silicon solid-state device composed of many cells light intensity
z Error typically same across columns, but different across rows
When exposed to light, each z Some of left most columns blocked by black paint to detect zero-bias error
cell becomes electrically The electromechanical shutter – Reading of other than 0 in blocked cells is zero-bias error
charged. This charge can Lens area is activated to expose the
– Each row is corrected by subtracting the average error found in blocked cells for
then be converted to an 8-bit cells to light for a brief
value where 0 represents no Covered columns Electro-
moment. that row
exposure while 255 mechanical
shutter Covered
represents very intense The electronic circuitry, when Zero-bias
cells
Pixel rows

exposure of that cell to light. commanded, discharges the adjustment


Electronic
circuitry cells, activates the 136 170 155 140 144 115 112 248 12 14 -13 123 157 142 127 131 102 99 235
Some of the columns are electromechanical shutter, 145 146 168 123 120 117 119 147 12 10 -11 134 135 157 112 109 106 108 136
144 153 168 117 121 127 118 135 9 9 -9
covered with a black strip of and then reads the 8-bit 176 183 161 111 186 130 132 133 0 0 0
135 144 159 108 112 118 109 126
176 183 161 111 186 130 132 133
paint. The light-intensity of charge value of each cell. 144 156 161 133 192 153 138 139 7 7 -7 137 149 154 126 185 146 131 132
these pixels is used for zero- Pixel columns These values can be clocked 122 131 128 147 206 151 131 127 2 0 -1 121 130 127 146 205 150 130 126
121 155 164 185 254 165 138 129 4 4 -4
bias adjustments of all the out of the CCD by external 173 175 176 183 188 184 117 129 5 5 -5
117 151 160 181 250 161 134 125
168 170 171 178 183 179 112 124
cells. logic through a standard
parallel bus interface. Before zero-bias adjustment After zero-bias adjustment

189 190

The CCD is a light-sensitive silicon solid-state device composed of many cells The zero-bias error is caused by manufacturing instability that causes cells to
that can be used to capture digital images. The details of the CCD are as register some charge even if they are not exposed to any light at all. The zero-
described in the slide. Each cell is charged when exposed to light and this charge bias error is the same for cells of the same row, but different for cells of different
is then converted into an 8-bit digital format. Due to manufacturing instability, rows. Thus, the error can be corrected by covering some of the left most columns
there are offsets introduced for cells of different rows. This is compensated for by with black paint so that light is blocked from those cells and then the charge
covering two or more columns of the CCD with a black strip of paint such that registered by the blacked out cells is used compensate for the error. An example
the cells of those columns are always hidden from light. Details on this “zero- of an 8x8 image before and after zero-bias adjustment is shown in the slide. The
biasing” procedure will be given in later slides. There is electronic circuitry for zero bias adjustment is calculated as an average of the charges in the blacked out
reading the charges from each cell, column-wise, such that the image can be cells of a particular row. This adjustment is then deducted from the charges of the
stored in memory. other cells.

189 190
Compression DCT step
z Store more images
z Transforms original 8 x 8 block into a cosine-frequency
z Transmit image to PC in less time domain
z JPEG (Joint Photographic Experts Group) – Upper-left corner values represent more of the essence of the image
– Popular standard format for representing digital images in a – Lower-right corner values represent finer details
compressed form • Can reduce precision of these values and retain reasonable image quality
– Provides for a number of different modes of operation z FDCT (Forward DCT) formula
– Mode used in this example provides high compression ratios using – C(h) = if (h == 0) then 1/sqrt(2) else 1.0
DCT (discrete cosine transform) • Auxiliary function used in main function F(u,v)
– Image data divided into blocks of 8 x 8 pixels – F(u,v) = ¼ x C(u) x C(v) Σx=0..7 Σy=0..7 Dxy x cos(π(2u + 1)u/16) x cos(π(2y + 1)v/16)
• Gives encoded pixel at row u, column v
– 3 steps performed on each block • Dxy is original pixel value at row x, column y
• DCT
• Quantization
z IDCT (Inverse DCT)
– Reverses process to obtain original block (not needed for this design)
• Huffman encoding

191 192

Images are captured and compressed before archiving them. The standard format This is the Discrete Cosine Transform (DCT) step in JPEG. The main purpose of
is JPEG, which is also very popularly used. There are several operation modes in DCT is to represent the most significant features of the image in the upper-left
JPEG, of which the mode used in this example is high compression ratios using corner, and leave the finer details in the lower-right corner. The values near the
DCT. Image is processed block by block where each block is 8 x 8 pixels large. lower-right corner can be thus be ignored to save storage space.
There are three steps performed on each block, including DCT, quantization, and
Huffman encoding.
FDCT and IDCT are inverse functions of each other. Only FDCT is used in
JPEG, IDCT is used in MPEG and other video standards. The function F(u, v)
It must be noted here that for a color image, some kind of color filter such as gives the newly encoded pixel at (u, v) calculated from the original value Dxy at
Bayer’s Color Filter Array (CFA) must be used before the 8 x 8 pixels is pixel (x, y).
compressed and the image color must also be reconstructed from the 8 x 8 pixels.

191 192
Quantization step Huffman encoding step
z Achieve high compression ratio by reducing image z Serialize 8 x 8 block of pixels
quality – Values are converted into single list using zigzag pattern
– Reduce bit precision of encoded data
• Fewer bits needed for encoding
• One way is to divide all values by a factor of 2
– Simple right shifts can do this
– Dequantization would reverse process for decompression
z Perform Huffman encoding
– More frequently occurring pixels assigned short binary code
– Longer binary codes left for less frequently occurring pixels
1150 39 -43 -10 26 -83 11 41 144 5 -5 -1 3 -10 1 5
Divide each cell’s
-81
14
-3 115
-11 1
-73
-42
-6
26
-2
-3
22
17
-5
-38 value by 8
-10
2
0
-1
14
0
-9
-5
-1
3
0
0
3
2
-1
-5
z Each pixel in serial list converted to Huffman encoded values
2 -61 -13 -12 36 -23 -18 5 0 -8 -2 -2 5 -3 -2 1 – Much shorter list, thus compression
44 13 37 -4 10 -21 7 -8 6 2 5 -1 1 -3 1 -1
36 -11 -9 -4 20 -28 -21 14 5 -1 -1 -1 3 -4 -3 2
-19 -7 21 -6 3 3 12 -21 -2 -1 3 -1 0 0 2 -3
-5 -13 -11 -17 -4 -1 7 -4 -1 -2 -1 -2 -1 0 1 -1

After being decoded using DCT After quantization

193 194

This is the quantization step that is performed after DCT in JPEG. The main This is the Huffman encoding step which is performed after quantization in
purpose of quantization is to reduce the amount of memory space required to JPEG. An 8 x 8 pixel array captured by the electronic circuitry in a CCD is first
store the image. Larger values require more space, hence the values are reduced transformed by DCT, then quantized, and finally archived using the Huffman
(quantized) by some factor. In the example, each value in the 8 x 8 DCT-encoded encoding method. Huffman encoding is a lossless encoding method, which adopts
pixels is divided by 8 to obtain the right hand side smaller valued pixel array. a zigzag pattern of storing the 8 x 8 pixel array, starting from the upper-left corner
Fewer bits are thus required to store the right hand side quantized array than the to the lower-right corner (see image in slide). The details of how Huffman
original left hand side one. encoding is performed are in the next slide.

193 194
Huffman encoding example Archive step
z Pixel frequencies on left z Record starting address and image size
– Pixel value –1 occurs 15 times
– Pixel value 14 occurs 1 time – Can use linked list
z Build Huffman tree from bottom up Pixel Huffman z One possible way to archive images
Huffman tree
– Create one leaf node for each pixel value
frequencies codes
and assign frequency as node’s value 64 – If max number of images archived is N:
– -1
• Set aside memory for N addresses and N image-size variables
Create an internal node by joining any -1 15x 00

two nodes whose sum is a minimal value 0 8x 0 100

• This sum is internal nodes value -2 6x


29
35 -2 110 • Keep a counter for location of next available address
1 5x 1 010
– Repeat until complete binary tree 2 1110 • Initialize addresses and image-size variables to 0
2 5x
z Traverse tree from root to leaf to obtain 18 17 3
• Set global memory address to N x 4
3 5x 14 1010
15
binary code for leaf’s pixel value 5 5x 5 0110

– Append 0 for left traversal, 1 for right -3 4x -1


9 10 11
-3 11110 – Assuming addresses, image-size variables occupy N x 4 bytes
traversal -5 3x 5 8 6 -5 10110
-10 2x 1 0 -2 -10 01110 • First image archived starting at address N x 4
z Huffman encoding is reversible 5 6 144 111111
– No code is a prefix of another code
144 1x
-9 1x
5
4
5 5 -9 111110
• Global memory address updated to N x 4 + (compressed image size)
5 3 2 -8
-8 1x
2 3
2
4
2 -4
101111
101110
z Memory requirement based on N, image size, and average
-4 1x 2
6 1x -10 -5
1
-3
1 1
6 011111
compression ratio
14 1x 1 1 1 14 011110

14 6 -4 -8 -9 144

195 196

The Huffman encoding uses the shortest string to represent the most frequently This step is performed after an image is captured and compressed. Normally,
used value and the longest string for the least frequently used. In the example some kind of filesystem such as FAT16 or FAT32 is used in Flash or other
above, -1 occurs the most frequently (15 times) and is represented by 00. A memory media for storing compressed images. Here, a simpler archiving method
Huffman tree is constructed bottom up. A leaf node is created for each pixel is used. Memory is allocated for N addresses and N image-size variables. A
value. Two leaf nodes are joined to an internal node if their sum is currently the counter points to the next available address. 4 bytes of addresses is assumed and
minimal value. The total value is thus 64 in the above Huffman tree example. The hence the global memory address requires N x 4 bytes. First image is stored at
string for a leaf node (a value) is obtained by traversing the tree top-down. A 0 is address N x 4. Image size variations can be accommodated in this archiving
appended for each left branch and a 1 for each right branch. Huffman encoding is method.
lossless and reversible because no code is a prefix of another code.

195 196
Uploading to PC Requirements Specification
z When connected to PC and upload command z System’s requirements – what system should do
received – Nonfunctional requirements
– Read images from memory • Constraints on design metrics (e.g., “should use 0.001 watt or less”)
– Transmit serially using UART – Functional requirements
• System’s behavior (e.g., “output X should be input Y times 2”)
– While transmitting
– Initial specification may be very general and come from marketing dept.
• Reset pointers, image-size variables and global memory pointer
• E.g., short document detailing market need for a low-end digital camera that:
accordingly
– captures and stores at least 50 low-res images and uploads to PC,
– costs around $100 with single medium-size IC costing less that $25,
– has long as possible battery life,
– has expected sales volume of 200,000 if market entry < 6 months,
– 100,000 if between 6 and 12 months,
– insignificant sales beyond 12 months

197 198

The uploading of compressed and stored images to PC can be accomplished in The requirements for a system can be segregated into functional and non-
this step. Images are read from memory and transmitted serially using UART. functional ones. Non-functional requirements include performance, reliability,
Transmitted images are automatically deleted from the archive by resetting security, safety, availability, and standards conformance. Functional requirements
pointers. include all the functions that a system should perform including mainly its
behavior.

Some examples of system specification for a digital camera SoC that come from
the marketing department are as shown in the slides. The market time window is
very important because a delayed time to market represents a system getting more
and more useless.

197 198
Nonfunctional requirements Nonfunctional requirements (cont.)
z Performance
z Design metrics of importance based on initial specification – Must process image fast enough to be useful
– Performance: time required to process image – 1 sec reasonable constraint
• Slower would be annoying
– Size: number of elementary logic gates (2-input NAND gate) in IC • Faster not necessary for low-end of market
– Therefore, constrained metric
– Power: measure of avg. electrical energy consumed while processing
z Size
– Energy: battery lifetime (power x time) – Must use IC that fits in reasonably sized camera
z Constrained metrics – Constrained and optimization metric
• Constraint may be 200,000 gates, but smaller would be cheaper
– Values must be below (sometimes above) certain threshold z Power
z Optimization metrics – Must operate below certain temperature (cooling fan not possible)
– Therefore, constrained metric
– Improved as much as possible to improve product
z Energy
z Metric can be both constrained and optimization – Reducing power or time reduces energy
– Optimized metric: want battery to last as long as possible

199 200

Non-functional requirements for a digital camera also include the power or Non-functional requirements are detailed in this slide for a digital camera SoC.
energy consumption because a high power consumption means a shorter battery The performance requirement is a one second constraint on the time interval
life and thus a poorer quality product to sell. Details are in next slide. between two shutter presses. This is a constrained metric because a longer delay
would be annoying (for example, you will be frustrated if you missed out on some
good football actions while the camera is processing a previous image!). A
Metrics can be of two types: constrained and optimization. Constrained metrics shorter delay is not required for the low-end market because that would mean an
specify the threshold (minimum, maximum) values. For example, a max cost, a unaffordable price for the low-end users.
min performance, a max memory size, etc. Optimization metrics specify the goals
of a product that must be met by designers while designing a system. For
example, we might need to minimize the cost under some performance The other nonfunctional requirements include constrained hardware size in gate
constraints or we might need to maximize the performance under some cost count (200,000 gates), power (must operate without fan), and energy (must be
constraints. optimized).

199 200
Informal functional specification Refined functional specification
z Refine informal specification into
z Flowchart breaks functionality one that can actually be executed
down into simpler functions z Can use C/C++ code to describe Executable model of digital camera
CCD Zero-bias adjust
each function
z Each function’s details could then input – Called system-level model, CCD.C
101011010
be described in English prototype, or simply model 110101010
DCT 010101101
– Also is first implementation ...
– Done earlier in chapter
CCDPP.C CODEC.C
Quantize z Can provide insight into
yes
operations of system image file

z Low quality image has resolution Archive in


no Done? – Profiling can find computationally CNTRL.C
memory
of 64 x 64 intensive functions 101010101
010101010
z Can obtain sample output used to 101010101
yes More Transmit serially 0...

z Mapping functions to a particular 8×8


no
serial output verify correctness of final UART.C
blocks? e.g., 011010...
implementation
processor type not done at this output file

stage

201 202

The flowchart is a popular model for describing the functionalities of a system as The detailed functional specification is given in this slide in the form of an
is the example given for the digital camera in the slide. executable model in terms of the C program files: ccd.c, ccdpp.c, cntrl.c, codec.c,
and uart.c. Profiling of the executables can provide information on the existence
of bottlenecks in the system. Each of the executable modules are described in the
There is a loop for processing all the 8 x 8 pixel blocks, a loop for transmitting all later slides.
the bits serially through UART and a reactive loop for processing the next image
from CCD input.

201 202
CCDPP (CCD PreProcessing)
CCD module
module
#define SZ_ROW 64
z Simulates real CCD z Performs zero-bias adjustment
#define SZ_COL 64
z CcdInitialize is passed name of image file void CcdInitialize(const char *imageFileName) { z CcdppCapture uses CcdCapture and CcdPopPixel to obtain static char buffer[SZ_ROW][SZ_COL];
z CcdCapture reads “image” from file imageFileHandle = fopen(imageFileName, "r"); image
static unsigned rowIndex, colIndex;
z CcdPopPixel outputs pixels one at a time rowIndex = -1; z Performs zero-bias adjustment after each row read in
void CcdppInitialize() {
colIndex = -1;
rowIndex = -1;
void CcdppCapture(void) {
#include <stdio.h> }
colIndex = -1;
char bias;
#define SZ_ROW 64
void CcdCapture(void) { }
CcdCapture();
#define SZ_COL (64 + 2)
int pixel; for(rowIndex=0; rowIndex<SZ_ROW; rowIndex++) { char CcdppPopPixel(void) {
static FILE *imageFileHandle;
rewind(imageFileHandle); for(colIndex=0; colIndex<SZ_COL; colIndex++) { char pixel;
static char buffer[SZ_ROW][SZ_COL];
for(rowIndex=0; rowIndex<SZ_ROW; rowIndex++) { buffer[rowIndex][colIndex] = CcdPopPixel(); pixel = buffer[rowIndex][colIndex];
static unsigned rowIndex, colIndex;
for(colIndex=0; colIndex<SZ_COL; colIndex++) { } if( ++colIndex == SZ_COL ) {
char CcdPopPixel(void) { if( fscanf(imageFileHandle, "%i", &pixel) == 1 ) { bias = (CcdPopPixel() + CcdPopPixel()) / 2; colIndex = 0;
char pixel;
pixel = buffer[rowIndex][colIndex]; buffer[rowIndex][colIndex] = (char)pixel; for(colIndex=0; colIndex<SZ_COL; colIndex++) { if( ++rowIndex == SZ_ROW ) {
if( ++colIndex == SZ_COL ) { } buffer[rowIndex][colIndex] -= bias; colIndex = -1;
colIndex = 0;
if( ++rowIndex == SZ_ROW ) { } } rowIndex = -1;
colIndex = -1; } } }
rowIndex = -1;
} rowIndex = 0; rowIndex = 0; }
} colIndex = 0; colIndex = 0; return pixel;
return pixel;
} } } }

203 204

The CCD module simulates a real CCD by generating one pixel at a time. The The CCDPP module performs zero-bias adjustment. It has the same format as the
image input however is pre-stored in a file. The format used for each module is CCD module. The adjustment is done in the ccdppcapture function.
similar: a initialization function, a capture function, and a pop function. The
image is assumed to be of size 64 x 64 pixels. There are two extra columns
(totally 66) which are used for zero-bias adjustment.

203 204
UART module CODEC module
z Actually a half UART static short ibuffer[8][8], obuffer[8][8], idx;
– Only transmits, does not receive z Models FDCT encoding
void CodecInitialize(void) { idx = 0; }
z ibuffer holds original 8 x 8 block
z UartInitialize is passed name of file to output to
z obuffer holds encoded 8 x 8 block void CodecPushPixel(short p) {
if( idx == 64 ) idx = 0;
z CodecPushPixel called 64 times to ibuffer[idx / 8][idx % 8] = p; idx++;
z UartSend transmits (writes to output file) bytes at a time
fill ibuffer with original block }

void CodecDoFdct(void) {
z CodecDoFdct called once to int x, y;
transform 8 x 8 block for(x=0; x<8; x++) {

– Explained in next slide for(y=0; y<8; y++)


obuffer[x][y] = FDCT(x, y, ibuffer);
#include <stdio.h>
static FILE *outputFileHandle;
z CodecPopPixel called 64 times to }

void UartInitialize(const char *outputFileName) { retrieve encoded block from obuffer }


idx = 0;
outputFileHandle = fopen(outputFileName, "w");
} short CodecPopPixel(void) {
void UartSend(char d) {
fprintf(outputFileHandle, "%i\n", (int)d); short p;
} if( idx == 64 ) idx = 0;
p = obuffer[idx / 8][idx % 8]; idx++;
return p;
}

205 206

The UART module simulates the transmitter half of a real UART. This module The codec module is responsible for FDCT encoding. The CodecPopPixel
outputs to a file. function is applied to an 8 x 8 block of pixels, so it is called 64 times for a 64 x
64 pixel sized image.

205 206
CODEC (cont.) CNTRL (controller) module
z Implementing FDCT formula z Heart of the system
void CntrlSendImage(void) {
C(h) = if (h == 0) then 1/sqrt(2) else 1.0 z CntrlInitialize for consistency with other modules only for(i=0; i<SZ_ROW; i++)
static const short COS_TABLE[8][8] = {
F(u,v) = ¼ x C(u) x C(v) Σx=0..7 Σy=0..7 Dxy x z CntrlCaptureImage uses CCDPP module to input image and for(j=0; j<SZ_COL; j++) {
{ 32768, 32138, 30273, 27245, 23170, 18204, 12539, 6392 }, place in buffer temp = buffer[i][j];
cos(π(2u + 1)u/16) x cos(π(2y + 1)v/16)
{ 32768, 27245, 12539, -6392, -23170, -32138, -30273, -18204 }, z CntrlCompressImage breaks the 64 x 64 buffer into 8 x 8 UartSend(((char*)&temp)[0]); /* send upper byte */
z Only 64 possible inputs to COS, so table can be UartSend(((char*)&temp)[1]); /* send lower byte */
blocks and performs FDCT on each block using the CODEC }
used to save performance time { 32768, 18204, -12539, -32138, -23170, 6392, 30273, 27245 },
}
module }
– Floating-point values multiplied by 32,678 and { 32768, 6392, -30273, -18204, 23170, 27245, -12539, -32138 },
rounded to nearest integer – Also performs quantization on each block
{ 32768, -6392, -30273, 18204, 23170, -27245, -12539, 32138 },
– 32,678 chosen in order to store each value in 2 bytes z CntrlSendImage transmits encoded image serially using void CntrlCompressImage(void) {
of memory { 32768, -18204, -12539, 32138, -23170, -6392, 30273, -27245 }, UART module
for(i=0; i<NUM_ROW_BLOCKS; i++)
– Fixed-point representation explained more later { 32768, -27245, 12539, 6392, -23170, 32138, -30273, 18204 },
for(j=0; j<NUM_COL_BLOCKS; j++) {
z FDCT unrolls inner loop of summation, { 32768, -32138, 30273, -27245, 23170, -18204, 12539, -6392 }
implements outer summation as two consecutive for(k=0; k<8; k++)
}; void CntrlCaptureImage(void) {
for loops CcdppCapture();
for(l=0; l<8; l++)
static int FDCT(int u, int v, short img[8][8]) {
CodecPushPixel(
double s[8], r = 0; int x; for(i=0; i<SZ_ROW; i++)
(char)buffer[i * 8 + k][j * 8 + l]);
for(x=0; x<8; x++) { for(j=0; j<SZ_COL; j++)
CodecDoFdct();/* part 1 - FDCT */
s[x] = img[x][0] * COS(0, v) + img[x][1] * COS(1, v) + buffer[i][j] = CcdppPopPixel();
static short ONE_OVER_SQRT_TWO = 23170; for(k=0; k<8; k++)
img[x][2] * COS(2, v) + img[x][3] * COS(3, v) + }
static double COS(int xy, int uv) { for(l=0; l<8; l++) {
img[x][4] * COS(4, v) + img[x][5] * COS(5, v) + #define SZ_ROW 64
return COS_TABLE[xy][uv] / 32768.0; buffer[i * 8 + k][j * 8 + l] = CodecPopPixel();
img[x][6] * COS(6, v) + img[x][7] * COS(7, v); #define SZ_COL 64
} /* part 2 - quantization */
static double C(int h) { } #define NUM_ROW_BLOCKS (SZ_ROW / 8)
buffer[i*8+k][j*8+l] >>= 6;
return h ? 1.0 : ONE_OVER_SQRT_TWO / 32768.0; for(x=0; x<8; x++) r += s[x] * COS(x, u); #define NUM_COL_BLOCKS (SZ_COL / 8)
}
} return (short)(r * .25 * C(u) * C(v)); static short buffer[SZ_ROW][SZ_COL], i, j, k, l, temp; }
} void CntrlInitialize(void) {} }

207 208

There are only 64 different input values for COS, hence we can use a table for The CNTRL (controller module) is the heart of the system. It initializes all the
precomputing and storing all the COS values in a lookup table. The equation for modules, uses CCDPP module to input image and place in buffer, breaks the 64 x
COS is implemented in the FDCT function. 64 buffer into 8 x 8 blocks, and performs FDCT on each block. Finally, it
transmits the image serially using UART module.

207 208
Putting it all together Design
z Determine system’s architecture
– Processors
z Main initializes all modules, then uses CNTRL module to capture,
• Any combination of single-purpose (custom or standard) or general-purpose processors
compress, and transmit one image
– Memories, buses
z This system-level model can be used for extensive experimentation z Map functionality to that architecture
– Bugs much easier to correct here rather than in later models – Multiple functions on one processor
– One function on one or more processors
z Implementation
– A particular architecture and mapping
int main(int argc, char *argv[]) {
– Solution space is set of all implementations
char *uartOutputFileName = argc > 1 ? argv[1] : "uart_out.txt";
char *imageFileName = argc > 2 ? argv[2] : "image.txt";
z Starting point
/* initialize the modules */ – Low-end general-purpose processor connected to flash memory
UartInitialize(uartOutputFileName);
CcdInitialize(imageFileName); • All functionality mapped to software running on processor
CcdppInitialize(); • Usually satisfies power, size, and time-to-market constraints
CodecInitialize();
CntrlInitialize(); • If timing constraint not satisfied then later implementations could:
/* simulate functionality */
CntrlCaptureImage();
– use single-purpose processors for time-critical functions
CntrlCompressImage(); – rewrite functional specification
CntrlSendImage();
}

209 210

This is the main C function which uses the CNTRL module. An architecture is decided for the system by selecting the processor, memories,
and buses. The functionality is then mapped to the architecture and finally
implemented. A purely software implementation is first executed to evaluate the
system’s performance. Usually power, size, and time-to-market constraints are
satisfied, however timing constraint may not be satisfied. A single-purpose
processor is can be used to accelerate parts of the system that require heavy
computations.

209 210
Implementation 1: Microcontroller Implementation 2:
alone Microcontroller and CCDPP
z Low-end processor could be Intel 8051 microcontroller EEPROM 8051 RAM
z Total IC cost including NRE about $5
z Well below 200 mW power
SOC UART CCDPP
z Time-to-market about 3 months
z However, one image per second not possible
– 12 MHz, 12 cycles per instruction z CCDPP function implemented on custom single-purpose processor
• Executes one million instructions per second – Improves performance – less microcontroller cycles
– CcdppCapture has nested loops resulting in 4096 (64 x 64) iterations – Increases NRE cost and time-to-market
• ~100 assembly instructions each iteration – Easy to implement
• 409,000 (4096 x 100) instructions per image • Simple datapath
• Half of budget for reading image alone • Few states in controller
– Would be over budget after adding compute-intensive DCT and Huffman z Simple UART easy to implement as single-purpose processor also
encoding
z EEPROM for program memory and RAM for data memory added as well

211 212

This is the first implementation using a single microcontroller such as Intel 8051. To accelerate the software implementation (previous slide), the ccdpp function is
The total IC cost including NRE is about US$5 and the power is well below the implemented on a custom single-purpose processor. UART is also implemented
200 mW power constraint. The time-to-market is about 3 months. However, this as a single-purpose processor. EEPROM and RAM are added, too.
implementation cannot meet the performance requirements of 1 second per image
capture. 8051 is 12 MHz, that is, 12 cycles per instruction and can execute one
million instructions per second. However, CcdppCapture itself requires nearly
half of this budget. Thus, this implementation cannot meet the performance
constraints because DCT and Huffman encoding will be even more compute-
intensive.

211 212
Microcontroller UART
z Synthesizable version of Intel 8051 available z UART in idle mode until invoked
– Written in VHDL – UART invoked when 8051 executes store instruction
– Captured at register transfer level (RTL) with UART’s enable register as target address
z Fetches instruction from ROM Block diagram of Intel 8051 processor core • Memory-mapped communication between 8051 and all FSMD description of UART
z Decodes using Instruction Decoder single-purpose processors
Instruction 4K ROM • Lower 8-bits of memory address for RAM invoked Start:
z ALU executes arithmetic operations Decoder Idle: Transmit
• Upper 8-bits of memory address for memory-mapped I/O I=0 LOW
– Source and destination registers reside in RAM I<8
Controller devices
z Special data movement instructions used to 128
ALU

load and store externally


RAM z Start state transmits 0 indicating start of byte Stop: Data:
Transmit Transmit
transmission then transitions to Data state HIGH data(I),
z Special program generates VHDL description I=8 then I++
To External Memory Bus
of ROM from output of C compiler/linker z Data state sends 8 bits serially then transitions to
Stop state
z Stop state transmits 1 indicating transmission done
then transitions back to idle mode

213 214

The microcontroller is an Intel 8051 VHDL model at the RTL. It fetches The UART is described by an FSM in the slide, where it starts in the Idle mode
instruction from ROM, decodes it, and executes using the ALU. Special data and when invoked starts transmitting data bit-by-bit. When a byte is transmitted,
movement instructions are used to load and store externally. There is a special it stops and returns back to the Idle mode. The UART transmitter hardware is
program (given in the downloadable package) to compile the executable derived implemented using this FSM.
from C compiling/linking into ROM code for fetching and executing by the 8051
microcontroller.

213 214
CCDPP Connecting SOC components
z Hardware implementation of zero-bias operations
z Memory-mapped
– All single-purpose processors and RAM are connected to 8051’s memory bus
z Interacts with external CCD chip
– CCD chip resides external to our SOC mainly because combining CCD with z Read
ordinary logic not feasible
FSMD description of CCDPP – Processor places address on 16-bit address bus
z Internal buffer, B, memory-mapped to 8051 – Asserts read control signal for 1 cycle
C < 66
z Variables R, C are buffer’s row, column indices invoked GetRow: – Reads data from 8-bit data bus 1 cycle later
Idle: B[R][C]=Pxl
z GetRow state reads in one row from CCD to B R=0
C=C+1 – Device (RAM or SPP) detects asserted read control signal
C=0
– 66 bytes: 64 pixels + 2 blacked-out pixels – Checks address
R = 64 C = 66
z ComputeBias state computes bias for that row and – Places and holds requested data on data bus for 1 cycle
R < 64 ComputeBias:
stores in variable Bias Bias=(B[R][11] + z Write
NextRow: C < 64
B[R][10]) / 2
z FixBias state iterates over same row subtracting R++
C=0 – Processor places address and data on address and data bus
C=0
Bias from each element FixBias: – Asserts write control signal for 1 clock cycle
B[R][C]=B[R][C]-Bias
z NextRow transitions to GetRow for repeat of C = 64
– Device (RAM or SPP) detects asserted write control signal
process on next row or to Idle state when all 64 – Checks address bus
rows completed
– Reads and stores data from data bus

215 216

The CCDPP FSM is also illustrated in the slide. It starts in the Idle mode, when All the SoC components described in the previous slides are then interconnected.
invoked gets a row of pixel values, computes the corresponding bias, which is The basic transactions after interconnecting the components are read and write,
then used to fix all the row pixels, and another is obtained and the process which are very similar to microprocessor transactions.
repeats. B is the internal buffer used to store the row pixels, that is, totally 66
bytes.

215 216
Software Analysis
z System-level model provides majority of code
– Module hierarchy, procedure names, and main program unchanged
z Entire SOC tested on VHDL simulator
– Interprets VHDL descriptions and
z Code for UART and CCDPP modules must be redesigned
functionally simulates execution of system
– Simply replace with memory assignments
• Recall program code translated to VHDL
• xdata used to load/store variables over external memory bus description of ROM Obtaining design metrics of interest
• _at_ specifies memory address to store these variables
• Byte sent to U_TX_REG by processor will invoke UART
– Tests for correct functionality
VHDL VHDL VHDL
– Measures clock cycles to process one Power
• U_STAT_REG used by UART to indicate its ready for next byte equation
– UART may be much slower than processor image (performance)
VHDL Synthesis
– Similar modification for CCDPP code z Gate-level description obtained through simulator tool
Gate level
z All other modules untouched synthesis simulator

– Synthesis tool like compiler for SPPs gates gates gates


Power
– Simulate gate-level models to obtain data
Original code from system-level model Rewritten UART module for power analysis Execution time Sum gates
#include <stdio.h> static unsigned char xdata U_TX_REG _at_ 65535; • Number of times gates switch from 1 to 0 or 0 Chip area
static FILE *outputFileHandle; static unsigned char xdata U_STAT_REG _at_ 65534;
void UartInitialize(const char *outputFileName) { void UARTInitialize(void) {}
to 1
outputFileHandle = fopen(outputFileName, "w"); void UARTSend(unsigned char d) { – Count number of gates for chip area
} while( U_STAT_REG == 1 ) {
void UartSend(char d) { /* busy wait */
fprintf(outputFileHandle, "%i\n", (int)d); }
} U_TX_REG = d;
}

217 218

This slide shows the method for changing or transforming a module implemented The entire SoC was tested on a VHDL simulator, where the software code in C
in software into one that can drive hardware single-purpose processors. It is was compiled, linked, and translated into VHDL ROM code for logic simulation.
basically a replacement of the main functions inside a software module with The simulation was cycle accurate and could measure the number of cycles
hardware register setting or some other kind of hardware control (e.g., control required for processing an image. Power is estimated only after the system is
signal setting through GPIO ports of a microprocessor). synthesized into gate level and re-simulated at the gate level. The size of the
system can be estimated in terms of the gate count.
The UART example shows how the UartInitialized is emptied as it is now not
required and how the UartSend function is changed to setting and checking
UART registers, rather than outputting to a file.

217 218
Implementation 2: Implementation 3: Microcontroller
Microcontroller and CCDPP and CCDPP/Fixed-Point DCT
z Analysis of implementation 2 z 9.1 seconds still doesn’t meet performance constraint
– Total execution time for processing one image: of 1 second
• 9.1 seconds z DCT operation prime candidate for improvement
– Power consumption: – Execution of implementation 2 shows microprocessor spends
• 0.033 watt most cycles here
– Energy consumption: – Could design custom hardware like we did for CCDPP
• 0.30 joule (9.1 s x 0.033 watt) • More complex so more design effort
– Total chip area: – Instead, will speed up DCT functionality by modifying
• 98,000 gates behavior

219 220

This slide shows the estimation results for implementation 2. It requires totally Since implementation 2 does not meet the performance constraint of 1 second per
9.1 seconds for processing an image, 0.033 watt power consumption, 0.30 joule image, it is found through profiling that DCT, which is the most compute-
of energy consumption (9.1 s x 0.033 watt), and the total chip area is 98,000 intensive (more than 80% of total execution time), must be accelerated. We will
gates. first look at a software improvement on the floating point computations which are
very expensive.
The gate count and the power/energy consumption were within the specified
constraints but the speed of processing images (9.1 seconds) is much larger than
the 1 second limit.

219 220
DCT floating-point cost Fixed-point arithmetic
z Floating-point cost z Integer used to represent a real number
– DCT uses ~260 floating-point operations per pixel transformation – Constant number of integer’s bits represents fractional portion of real
number
– 4096 (64 x 64) pixels per image
• More bits, more accurate the representation
– 1 million floating-point operations per image – Remaining bits represent portion of real number before decimal point
– No floating-point support with Intel 8051 z Translating a real constant to a fixed-point representation
• Compiler must emulate – Multiply real value by 2 ^ (# of bits used for fractional part)
– Generates procedures for each floating-point operation – Round to nearest integer
» mult, add – E.g., represent 3.14 as 8-bit integer with 4 bits for fraction
– Each procedure uses tens of integer operations • 2^4 = 16
– Thus, > 10 million integer operations per image • 3.14 x 16 = 50.24 ≈ 50 = 00110010
• 16 (2^4) possible values for fraction, each represents 0.0625 (1/16)
– Procedures increase code size
• Last 4 bits (0010) = 2
z Fixed-point arithmetic can improve on this • 2 x 0.0625 = 0.125
• 3(0011) + 0.125 = 3.125 ≈ 3.14 (more bits for fraction would increase accuracy)

221 222

The DCT function requires around 260 floating-point operations per pixel Fixed-point representation is used mainly to reduce the complexity of arithmetic
transformation, which amounts to 1 million such operations per image (64 x 64 operations for floating-points. An integer is used to represent a real number.
pixels). There is no floating-point support in 8051, hence compiler must emulate Some bits in an integer representation is used to represent the fractional part of a
the floating-point operation using integer operations. Each floating point real number and the remaining bits are used to present the integer portion (before
operation is emulated using an average of 10 integer operations. Thus, we need decimal point). A real value is multiplied by 2 ^ (#bits used for fractional part)
totally more than 10 million integer operations per image. The procedures used to and then rounded to nearest integer. An example is given here. Suppose an
emulate floating-point operation increases the overhead. integer has 8 bits, 4 of which are used to represent the fraction. The
transformation for 3.14 into its fixed-point representation 50 is shown in the slide.
An inverse calculation process starting from 50 does not result in 3.14, it
The solution for this is “fixed-point arithmetic,” as discussed later. becomes 3.125. Thus, there is an approximation.

221 222
Fixed-point implementation of
Fixed-point arithmetic operations
CODEC
z Addition z COS_TABLE gives 8-bit fixed-point
static const char code COS_TABLE[8][8] = {
{ 64, 62, 59, 53, 45, 35, 24, 12 },
– Simply add integer representations representation of cosine values { 64, 53, 24, -12, -45, -62, -59, -35 },
– E.g., 3.14 + 2.71 = 5.85 { 64, 35, -24, -62, -45, 12, 59, 53 },
{ 64, 12, -59, -35, 45, 53, -24, -62 },
• 3.14 → 50 = 00110010 z 6 bits used for fractional portion { 64, -12, -59, 35, 45, -53, -24, 62 },

• 2.71 → 43 = 00101011 { 64, -35, -24, 62, -45, -12, 59, -53 },

• 50 + 43 = 93 = 01011101 z Result of multiplications shifted right by {


{
64,
64,
-53,
-62,
24,
59,
12,
-53,
-45,
45,
62,
-35,
-59,
24,
35 },
-12 }

• 5(0101) + 13(1101) x 0.0625 = 5.8125 ≈ 5.85 6 };

z Multiply static const char ONE_OVER_SQRT_TWO = 5;


static short xdata inBuffer[8][8], outBuffer[8][8], idx;
– Multiply integer representations static unsigned char C(int h) { return h ? 64 : ONE_OVER_SQRT_TWO;}
static int F(int u, int v, short img[8][8]) { void CodecInitialize(void) { idx = 0; }
– Shift result right by # of bits in fractional part long s[8], r = 0; void CodecPushPixel(short p) {
– E.g., 3.14 * 2.71 = 8.5094 unsigned char x, j; if( idx == 64 ) idx = 0;
• 50 * 43 = 2150 = 100001100110 for(x=0; x<8; x++) { inBuffer[idx / 8][idx % 8] = p << 6; idx++;
s[x] = 0;
• >> 4 = 10000110 for(j=0; j<8; j++)
}

• 8(1000) + 6(0110) x 0.0625 = 8.375 ≈ 8.5094 s[x] += (img[x][j] * COS_TABLE[j][v] ) >> 6;


void CodecDoFdct(void) {
unsigned short x, y;
z Range of real values used limited by bit widths of possible resulting values } for(x=0; x<8; x++)
for(y=0; y<8; y++)
for(x=0; x<8; x++) r += (s[x] * COS_TABLE[x][u]) >> 6;
outBuffer[x][y] = F(x, y, inBuffer);
return (short)((((r * (((16*C(u)) >> 6) *C(v)) >> 6)) >> 6) >> 6); idx = 0;
}
}

223 224

Addition and multiplication for real numbers represented in fixed-point format The CODEC module is transformed into a fixed-point implementation version.
can be performed directly in their fixed point formats. Examples are given in the The COS_TABLE gives 8-bit fixed point representation of cosine values, where 6
slide. bits are used for fractional portion.

223 224
Implementation 3: Microcontroller Implementation 4:
and CCDPP/Fixed-Point DCT Microcontroller and CCDPP/DCT
z Analysis of implementation 3
– Use same analysis techniques as implementation 2 EEPROM 8051 RAM

– Total execution time for processing one image:


• 1.5 seconds CODEC UART CCDPP
SOC
– Power consumption:
• 0.033 watt (same as 2)
– Energy consumption:
• 0.050 joule (1.5 s x 0.033 watt) z Performance close but not good enough
• Battery life 6x longer!! z Must resort to implementing CODEC in hardware
– Total chip area: – Single-purpose processor to perform DCT on 8 x 8 block
• 90,000 gates
• 8,000 less gates (less memory needed for code)

225 226

By converting the CODEC module into fixed-point, we have implementation 3 Performance in implementation 3 is close but not good enough, hence we finally
which becomes faster at the expense of some approximation in the image pixel resort to implementing the CODEC hardware using a single-purpose processor,
values. The execution time for processing a single image is 1.5 seconds, the which performs DCT on 8 x 8 pixel block.
power is 0.033 watt, the energy is 0.050 joule, and the total chip area is 90,000
gates. Except for the execution time, all other constraints are satisfied.

225 226
Implementation 4:
CODEC design
Microcontroller and CCDPP/DCT
z 4 memory mapped registers
– C_DATAI_REG/C_DATAO_REG used to
z Analysis of implementation 4
push/pop 8 x 8 block into and out of – Total execution time for processing one image:
CODEC • 0.099 seconds (well under 1 sec)
– C_CMND_REG used to command
CODEC – Power consumption:
• Writing 1 to this register invokes CODEC • 0.040 watt
– C_STAT_REG indicates CODEC done • Increase over 2 and 3 because SOC has another processor
and ready for next block
• Polled in software
– Energy consumption:
z Direct translation of C code to VHDL for Rewritten CODEC software • 0.00040 joule (0.099 s x 0.040 watt)
actual hardware implementation static unsigned char xdata C_STAT_REG _at_ 65527; • Battery life 12x longer than previous implementation!!
static unsigned char xdata C_CMND_REG _at_ 65528;
– Fixed-point version used static unsigned char xdata C_DATAI_REG _at_ 65529;
static unsigned char xdata C_DATAO_REG _at_ 65530;
– Total chip area:
z CODEC module in software changed void CodecInitialize(void) {}
void CodecPushPixel(short p) { C_DATAO_REG = (char)p; } • 128,000 gates
similar to UART/CCDPP in short CodecPopPixel(void) {
return ((C_DATAI_REG << 8) | C_DATAI_REG); • Significant increase over previous implementations
implementation 2 }
void CodecDoFdct(void) {
C_CMND_REG = 1;
while( C_STAT_REG == 1 ) { /* busy wait */ }
}

227 228

The hardware implementation for CODEC is shown here. 4 memory mapped This hardware implementation of DCT results in implementation 4 and the
registers are used to record status, write commands, data input, and data output. execution time is reduced to 0.099 seconds, which is well under 1 second. Power
The fixed-point version of the C code is directly translated into VHDL code. The consumption is 0.04 watt, energy consumption is 0.0004 joule, and the total chip
software code for CODEC is changed in a way similar to that for UART/CCDPP area is 128,000 gates.
in implementation 2. Commands are written to the command register in
CodecDoFdct.

227 228
Summary of implementations Summary

Implementation 2 Implementation 3 Implementation 4


z Digital camera example
Performance (second) 9.1 1.5 0.099
Power (watt) 0.033 0.033 0.040 – Specifications in English and executable language
Size (gate) 98,000 90,000 128,000
Energy (joule) 0.30 0.050 0.0040 – Design metrics: performance, power and area
z Implementation 3 z Several implementations
– Close in performance
– Microcontroller: too slow
– Cheaper
– Less time to build
– Microcontroller and coprocessor: better, but still too slow
z Implementation 4 – Fixed-point arithmetic: almost fast enough
– Great performance and energy consumption – Additional coprocessor for compression: fast enough, but
– More expensive and may miss time-to-market window expensive and hard to design
• If DCT designed ourselves then increased NRE cost and time-to-market
– Tradeoffs between hw/sw!
• If existing DCT purchased then increased IC cost
z Which is better?

229 230

A summary of the features of the 4 different implementations is tabulated here. As a summary, digital camera example was specified in English and an
Implementation 3 is close in performance, cheaper, and requires lesser than to executable language. There were 3 design metrics, including performance, power,
build. Implementation 4 has great performance and energy consumption, and area. There is tradeoff between hardware and software during the
however, it is more expensive than implementation 3. Which is better? May be implementation of the 3 different design alternatives for the digital camera.
implementation 3 is suitable for users who can accommodate a slight
performance degradation by paying less for the camera, whereas implementation
4 is for users who are willing to pay for high performance. This is the very reason
why we have different product versions for the same brand product with similar
functionalities.

229 230

You might also like