0% found this document useful (0 votes)

127 views75 pages

EE382N-4 Advanced Microcontroller Systems: Accelerators and Co-Processors

This document summarizes a class on accelerators and co-processors. It discusses the differences between accelerators and co-processors, with accelerators appearing as devices on a bus controlled by registers, while co-processors execute instructions dispatched by the CPU. Examples are given of tightly and loosely coupled co-processors. Common applications of hardware acceleration are also outlined, such as for graphics, audio/video processing, and encryption. Decision trees are presented for when and how to implement a hardware accelerator. Finally, different programming models for hybrid CPU-accelerator systems are described.

Uploaded by

hilgad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views75 pages

EE382N-4 Advanced Microcontroller Systems: Accelerators and Co-Processors

Uploaded by

hilgad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

EE382N-4

Advanced Microcontroller Systems

Accelerators and Co-Processors

Mark McDermott

Spring 2018

EE382N-4 Class Notes

Agenda
§ Taxonomy of Hardware Acceleration
§ Co-Processors
– Tightly coupled
– Loosely coupled
• FP Matrix Multiplier
• MC68332 Time Processing Unit
§ ISA Enhancements:
– HC12 Fuzzy Logic Acceleration
§ Reconfigurable architectures
– Tensilica

2/22/18 EE382N-4 Class Notes 2

Taxonomy of Hardware Acceleration

EE382N-4 Class Notes

Accelerator vs. co-processor
§ A co-processor executes instructions.
– Instructions are dispatched by the CPU.

§ An accelerator appears as a device on the bus.

– The accelerator is controlled by registers.

2/22/18 EE382N-4 Class Notes 4

Hardware Acceleration
§ Ad hoc interface to controlling processor
– Usually memory-mapped
– Bus-based, FIFO, or register data interfaces

§ Typically, the processor transfers data to the accelerator, issues a

go command, and then collects result data later.
– Polled or interrupt-based interface

§ Accelerator may have its own path to/from memory

§ Often fixed function but can be microcoded for programmability

2/22/18 EE382N-4 Class Notes 5

Common HW Acceleration Applications
§ Graphics
§ Data Compression
§ Audio/Video Encoding/Decoding
§ Image sensing and processing
§ Data Encryption: RSA, DES, AES
§ Router frame queuing, port selection

2/22/18 EE382N-4 Class Notes 6

Common HW Acceleration Applications
§ Embedded Systems
– FPGAs appearing in set-top boxes, routers, audio equipment, etc.
• Advantages
– Performance close to ASIC, sometimes at much lower cost
Ø Many other embedded systems still use ASIC due to high volume
> Cell phones, iPod, game consoles, etc.
– Reconfigurable!
Ø If standards change, architecture is not fixed
Ø Can add new features after production

2/22/18 EE382N-4 Class Notes 7

Common HW Acceleration Applications
§ High-performance embedded computing (HPEC)
– High-performance/super computing with special needs (low power, low
size/weight, etc.)
• Satellite image processing
• Target recognition in a UAV
– Advantages
• Much smaller/lower power than a supercomputer
• Fault tolerance

EE382N-4 Class Notes 8

Common HW Acceleration Applications
§ High-performance computing (HPC)
– Cray, SGI, DRC, GiDEL, Nallatech, XtremeData
• Combine high-performance microprocessors with FPGA accelerators
– Novo-G
• 192 Altera Stratix III FPGAs integrated with 24 quad-core microprocessors
§ Advantages
– HPC used for many scientific apps
• Low volume, ASIC rarely feasible, microprocessor too slow
– Lower power consumption
• Increasingly important
• Cooling and energy costs are dominant factor in total cost of ownership

EE382N-4 Class Notes 9

Common HW Acceleration Applications
§ General-purpose computing
– Ideal situation: desktop machine/OS uses a programmable accelerator to
speedup up all applications (similar to GPU trend)
– Problems
• The accelerator can be very fast, but not for all applications
– Generally requires parallel algorithms
• Coding constructs used in many applications not appropriate for hardware
– Subject of tremendous amount of past and likely future research
§ How to use extra transistors on general purpose CPUs?
– More cache
– More microprocessor cores
– GPU
– FPGA?
– Something else?

2/22/18 EE382N-4 Class Notes 10

Decision Tree: When do you use a hardware accelerator?

Easy

Can an existing algorithm be implemented using existing ISA?

Can a new algorithm be devised to solve problem using existing ISA?

Can API be modified to expose necessary functionality or make it easier to exploit?

Can HW accelerator be added as a co-processor instruction

Can ISA be modified to better support algorithm?

Can datapath be modified to better support algorithm, without breaking others?

Hard

2/22/18 EE382N-4 Class Notes 11

Four Programmers Models of Accelerator Design

CPU Accelerator CPU Accelerator

Application Application
OS
No OS Service (in simple
Base - HW I/F only embedded systems)

CPU Accelerator CPU Accelerator

Application Application

mmap() dev() driver

OS OS

OS service – Accelerator accessed Virtualized Device with

as a user space memory mapped OS scheduling support
I/O device

Perkowski, psu.edu

2/22/18 EE382N-4 Class Notes 12

Hybrid Hardware/Software Execution Model
§ Hardware Accelerator as a Kernel Module
– Seamless integration of hardware accelerators into
the Linux software stack for use by mainstream
Source code
applications
Compiler analysis/transformations Human designed – The KM approach enables transparent interchange
Compile
hardware of software and hardware components
Synthesis
Time

User § Application level execution model

Runtime DLL
– Compiler deep analysis and transformations
Application generate CPU code, hardware library stubs and
synthesized components
Linker/Loader Resource manager – FPGA bitmaps as hardware counterpart to existing
software modules.
Kernel – Same dynamic linking library interfaces and stubs
Runtime
OS modules

Linux OS apply to both software and hardware

memory
implementation

§ OS resource management
CPU
FPGA – Services (API) for allocation, partial reconfiguration,
accele-
devices rators saving and restoring the status, and monitoring
– Multiprogramming scheduler can pre-fetch
hardware accelerators in time for next use
Soft object
– Control the access to the new hardware to ensure
User level function or device driver:
Hard object trust under private or shared use

Perkowski, psu.edu

2/22/18 EE382N-4 Class Notes 13

CPU-Accelerator Interface
§ Accelerator registers provide control registers for CPU
§ Data registers can be used for small data objects
§ Accelerator may include special-purpose read/write logic (DMA
hardware)
– Especially valuable for large data transfers

2/22/18 EE382N-4 Class Notes 14

CPU-Accelerator Interface Example

Block 6
Accelerator
§ AXI
RAM – 32 bit Bus
5
PL – Access to DRAM data &
programmable logic fabric
AXI – 1/2 CPU frequency
2 – Big penalty if bus is busy during
Slave I/C PS first attempt to access bus

DDR
1 3
4
DDR § AHB (AMBA High Speed Bus)
ARM Core
Controller
– 64 bit bus
– Runs at CPU clock frequency
Pipelined
First Access Arbitration – Access to DDR Controller to
Bus Access
provide addresses to SDRAM
Read Write Read Write
1 ARM à I/C 2 2 2 2
2 I/Cà AXI 8 8 3 3 5
3 AHB à DDRC 4 4 4 4
4 DDRC à DRAM 8 9 3 3 5
5 AXI ↔ BRAM 20 20 8 8 12
6 BRAM ↔ ACC 2 2 2 2 Perkowski, psu.edu

2/22/18 EE382N-4 Class Notes 15

Hardware Accelerator Interface: Interrupts or Polling?
§ Polling interfaces usually require the processor to read a
memory-mapped register to determine the state of the
accelerator.
– Can the accelerator accept new input data?
– Is the accelerator done with its current task?
– Has the accelerator generated an error condition?

§ Polling interfaces offer minimal latency between the setting of a

condition on the accelerator and its
discovery by the controlling processor.
– But processor isn’t doing useful work while it polls…

2/22/18 EE382N-4 Class Notes 16

Hardware Accelerator Interface: Interrupts or Polling?
§ Interrupt-based interfaces allow the accelerator to signal
conditions to the controlling processor.
– Interrupt latency is longer than is achievable via the polling method.
– But the processor can more easily proceed with other work while the
accelerator is busy with a task.

§ Interrupts more efficient for coarse grained parallelism (i.e.,

larger tasks with looser and less frequent synchronization
requirements)

§ Interrupts may not work for real-time control tasks with tight
schedules

2/22/18 EE382N-4 Class Notes 17

Typical CPU à Accelerator Transaction

Time à
Application Operating System Hardware
open(/dev/accel); /* only once*/
Enable Accelerator AXI
Access for
…
Application
/* construct macroblocks */
macroblock = …
ARM Memory
syscall(&macroblock,
num_blocks) Data copy
…
AXI
Flush Cache Range ARM Memory

Setup DMA Transfer AXI DMA

ARM Controller
Poll AXI Accelerator
ARM (Executing)

AXI DMA
Setup DMA Transfer ARM Controller

… Invalidate Cache Range AXI

/* macroblock now has
transformed data */ ARM Memory
…
Data Copy
AXI
ARM Memory
Perkowski, psu.edu
2/22/18 EE382N-4 Class Notes 18
Caching Issues with Accelerators
§ Main memory provides the primary data transfer mechanism to
the accelerator.
§ Programs must ensure that caching does not invalidate main
memory data.
– CPU reads location S.
– Accelerator writes location S.
– CPU writes location S.
• BAD – Program will not see proper value of S stored in the cache

The bus interface may provide mechanisms for accelerators to

tell the CPU of required cache changes…

2/22/18 EE382N-4 Class Notes 19

Synchronization and Memory
§ As with cache, main memory writes to shared memory may cause
invalidation (memory incoherence).
– CPU reads location S
– Accelerator writes S
– CPU writes S

§ Many CPU buses implement test-and-set atomic operations that

the accelerator can use to implement a semaphore. This can
serve as a highly efficient means of synchronizing inter-process
Communications (IPC)

2/22/18 EE382N-4 Class Notes 20

Software Versus Hardware Acceleration

Overhead is a
major issue!

Perkowski, psu.edu
2/22/18 EE382N-4 Class Notes 21
Device Driver Access Cost

Perkowski, psu.edu

2/22/18 EE382N-4 Class Notes 22

Co-Processors

EE382N-4 Class Notes

Tightly Coupled Coprocessors
§ Integrated with processor control logic
– Task typically completes in a few cycles
– Small amounts of data
– Processor stalls waiting for the coprocessor
– Communication with coprocessor typically via registers and dedicated
control signals
– Coprocessor ports
• Examples:
– ARM (ARM7TDMI);
– Texas Instruments TMS320C55x processors

2/22/18 EE382N-4 Class Notes 24

Tightly-Coupled Coprocessor Example

Memory
System

T
Instruction C
C
decode I/F
TCC

TCC instructions
TMS320C55x

2/22/18 EE382N-4 Class Notes 25

Loosely-Coupled Coprocessors
§ Loosely-Coupled Coprocessors
– Used for larger tasks than is the case for
tightly-coupled coprocessors
– Task runs in parallel with main processor
– May take many cycles per task
– Large amounts of data that coprocessor may access independent of main
processor
– May or may not use the standard coprocessor interface

2/22/18 EE382N-4 Class Notes 26

Loosely-Coupled FP Matrix Multiplier Coprocessor

https://www.xilinx.com/support/documentation/application_notes/xapp1170-zynq-hls.pdf

2/22/18 EE382N-4 Class Notes 27

Accelerator Coherency Port (ACP)
§ Accelerator coherency port (ACP) is a 64-bit AXI slave interface
on the SCU that provides an asynchronous cache-coherent access
point directly from the PL to the Cortex-A9 MP-Core processor
subsystem.
§ A range of system PL masters can use this interface to access the
caches and the memory subsystem exactly the way the APU
processors do to simplify software, increase overall system
performance, or improve power consumption.
– Interface acts as a standard AXI slave and supports all standard read and
write transactions without any additional coherency requirementsplaced on
the PL components. Therefore, the ACP provides cache-coherent access from
the PL to ARM caches while any memory local to the PL are non-coherent
with the ARM.
http://www.xilinx.com/support/answer-navigation/answer-keyword-
search.html?type=answerRecord&analytics=AnswersDatabase&searchKeywords=ZYNQ+ACP

https://www.xilinx.com/support/documentation/sw_manuals/ug1046-ultrafast-design-methodology-guide.pdf

2/22/18 EE382N-4 Class Notes 28

ACP Usage
§ The ACP provides a low latency path between the PS and the
accelerators implemented in the PL when compared with a legacy
cache flushing and loading scheme. Steps that must take place in
an example of a PL-based accelerator are as follows:
– The CPU prepares input data for the accelerator within its local cache space.
– The CPU sends a message to the accelerator using one of the general
purpose AXI master interfaces to the PL.
– The accelerator fetches the data through the ACP, processes the data, and
returns the result through the ACP.
– The accelerator sets a flag by writing to a known location to indicate that the
data processing is complete. Status of this flag can be polled by the
processor

2/22/18 EE382N-4 Class Notes 29

ACP Caveats
§ NOTE: When compared to a tightly-coupled coprocessor, ACP
access latencies are relatively long. Therefore, ACP is not
recommended for fine-grained instruction level acceleration.
§ For coarse-grain acceleration such as video frame-level
processing, ACP does not have a clear advantage over traditional
memory-mapped PL acceleration because the transaction
overhead is small relative to the transaction time, and might
potentially cause undesirable cache thrashing.
§ ACP is therefore optimal for medium-grain acceleration, such as
block-level crypto accelerator and video macro-block level
processing.

2/22/18 EE382N-4 Class Notes 30

Micro-coded Co-Processor:
MC 68332 Time Processing Unit

EE382N-4 Class Notes

MC68332 Time Processing Unit
§ The TPU3 can be viewed as a special-purpose microcomputer that performs a
programmable series of two operations, match and capture.
§ The microengine uses microcode to perform functions.

Host Scheduler Service Timer

Interface Control Channels
Requests
Inter-Module Bus (IMB)

System Channel 0

Channel
Configuration Channel 1
CLK time
Development base
Support
and Test Pins
Microengine
Channel
Control Control
Data Control and
Store Data
Store
Parameter
RAM Execution
Unit
Channel 15

Motorola, Inc. Kurt Keutzer UCB

2/22/18 EE382N-4 Class Notes 32
Time Processing Unit
TPU Preprogrammed Functions:
§ Semi-autonomous microcontroller
§ Operates concurrently with CPU
• Schedules tasks
• Processes ROM instructions
• Accesses shared data with CPU
• Performs Input/Output operations
§ Programmable series of 2 operations
• Match
• Capture
§ Each operation is called an ``event’’
§ A pre-programmed series of event is
called a ``function’’

Motorola, Inc. Kurt Keutzer UCB

2/22/18 EE382N-4 Class Notes Page 33
Time Bases

§ Two sixteen-bit counters provide

time bases for all
§ Pre-scalers controlled by CPU via
bit-fields in TPU module
configuration register TPUCMR
§ Current values accessible via TCR1
and TCR2 registers
§ TCR1, TCR2 can be read/written
by TPU microcode- not available
to CPU
§ TC1 qualified by system clock
§ TC2 qualified by system clock or
external clock
Motorola, Inc. Kurt Keutzer UCB
2/22/18 EE382N-4 Class Notes 34
Timer Channels

§ Sixteen channels
– each one connect to a MCU pin
§ Each channel has symmetric
hardware:
§ Event register
– 16-bit capture register
– 16-bit compare/match register
– 16-bit comparator
§ Pin control logic - pin direction
determined by TPU microengine

Motorola, Inc. Kurt Keutzer UCB

2/22/18 EE382N-4 Class Notes 35
Scheduler

§ Determines which of sixteen

channels is serviced by the
microengine
§ Channel can request service
for one of four reasons
– host service
– link to another channel
– match event
– capture event
§ Host system assigns to each
channel a priority
– high
– middle
– low

Motorola, Inc. Kurt Keutzer UCB

2/22/18 EE382N-4 Class Notes 36
Microengine

§ Executes microcoded
functions for selected
channel.
§ Returns control to
scheduler when
completed.

Motorola, Inc. Kurt Keutzer UCB

2/22/18 EE382N-4 Class Notes 37
WCS – Writeable Control Store

§ The microcode in the TPU is

hard coded into a mask
programmable ROM.
§ To facilitate microcode
development and debug, a
block RAM can be used to
replace the ROM providing a
“Writeable Control Store”
capability.

68332 Die Photo

2/22/18 EE382N-4 Class Notes 38

ISA Enhancements: Fuzzy Logic Acceleration

EE382N-4 Class Notes

Overview
§ Fuzzy controllers provide a unique way of controlling complex
systems.
– If written in software they can take a long time to execute, limiting their
application in real-time systems.
– Using dedicated logic and a few new assembler instructions, a
microcontroller can be enhanced to execute a fuzzy controller quite
efficiently.

Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 40
Fuzzy Sets
§ In the real world, very few things belong to a single classification
and often the boundaries are not clear
§ Fuzzy sets, then, is the extension of regular (Aristotelian) sets:
– sets can overlap
– members of a set have a degree of membership instead of just belonging to
or not
§ Fuzzy sets can be given linguistic labels such as warm, positive
big, very cold, etc
§ Fuzzy logic allows the sets to be computed
– boundary conditions same as two-level logic

Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 41
Input Memberships
§ The degree to which an input belongs to a classification, is called
its membership value
§ The classifications can overlap so that an input value can belong
to more than one, hence the ability to fuzzify the input

negative big positive big

1.0 ($FF)

degree of zero
negative positive
membership
small small

0.0 ($00)
$00 $40 $80 $C0 $FF

ra nge of values

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 42
Fuzzy Controller

Inputs
Error
NB NS ZE PS PB
NB NS ZE PS PB NB ZE PS PB PB PB

Change
of Error
NS NS ZE PS PS PB
ZE NB NS ZE PS PB
PS NB NS NS ZE PS
Output PB NB NB NB NS ZE

Output Adjustment
NB NS ZE PS PB

§ error = setting - position

§ change of error = error - previous error
§ output is an adjustment to current output
Motorola, Inc Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 43
Fuzzy Rules
§ IF error is positive big THEN output is much smaller
§ IF error is small and change of error is positive big THEN output is
smaller
§ IF error is small and change of error is positive small THEN output
is a little smaller
§ IF error and change of error are zero THEN output is zero

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 44
Fuzzy Controller: Implementation
§ 68HC12 microcontroller (Motorola)
– Fuzzy logic instructions
§ Simple fuzzy system response on 68HC11:
– 750 ms
§ Simple fuzzy system response on 68HC12:
– 50 us (15,000 times faster)
• An enabling technology

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 45
Embedded Fuzzy
§ Single chip 68HC12 microcontroller $FF

§ Native fuzzy instructions:

degree
of
membership
slope1 slope2

– MEM; evaluate membership functions

0
0 point2 $FF
point1

n
– REV; rule evaluation: IF a is x THEN b is y
åS F
input range

– WAV; weighted averaging i i

§ Additional related instructions

system _ output = i =1
n

– MINA (place smaller of two unsigned 8-bit values in accumulator A) åF

i =1
i
– EMIND (place smaller of two unsigned 16-bit values in accumulator D)
– MAXM (place larger of two unsigned 8-bit values in memory)
– EMAXM (place larger of two unsigned 16-bit values in memory)
– TBL (table lookup and interpolate)
– ETBL (extended table lookup and interpolate)
– EMACS (extended multiply and accumulate signed 16-bit by16-bit to 32-bit)
§ Orders of magnitude faster than fuzzy routines
§ Includes: timers, PWM, A/D, Flash, RAM
§ Small and low power

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 46
68HC12: Fuzzy Innards
program system
knowledge data inputs assembler
base instructions fuzzy
inference
kernel
$FF
input
degree
member ship fuzzification
functions
of

mem
membership
slope1 slope2

0
0 point2 $FF
point1

fuzzy inputs
. . .
input range

(in RAM)

IF a is x rule list rule evalua tion

rev
THEN b is y
fuzzy outputs
. . .
(in RAM)

n
output
åS F i i member ship defuzzification
system _ output = i =1
n functions wav,ediv
åF
i =1
i

system
Motorola, Inc outputs Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 47
A Closer Look
§ Each fuzzy instruction uses an efficient memory structure to
maintain information.
– The MEM instruction uses an array of trapezoids for membership functions
and writes to an array of bytes with the calculated membership values for
each input.
– The REV instruction use a byte array of offsets and flags for the rules
antecedents and consequences and byte arrays for the outputs.
– The REVW instruction uses an array of16 bit pointers and flags for
antecedent and consequence values and a byte array for weights and
outputs
– The WAV instruction uses an byte array for outputs and singletons.

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 48
Trapezoidal Parameters

point1 point2 slope1 slope2

point1 point2 slope1 slope2
point1 point2 slope1 slope2
point1 point2 slope1 slope2
point1 point2 slope1 slope2
point1 point2 slope1 slope2
Motorola, Inc point1 point2 slope1 slope2 Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 49
Fuzzify

fuzzify: ldx #input_mfs ; point at membership functions

ldy #fuz_ins ; point at fuzzy input table
ldaa current_ins ; get first input values
ldab #7 ; 7 labels per input
loop: mem ; evaluate one membership func.
dbne b, loop ; for 7 labels of one input

X Y
point1 point2 slope1 slope2 mv1
point1 point2 slope1 slope2 mv2
point1 point2 slope1 slope2 mv3
point1 point2 slope1 slope2 mv4
point1 point2 slope1 slope2 mv5
point1 point2 slope1 slope2 mv6
point1 point2 slope1 slope2 mv7

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 50
Rule Evaluation
§ Rule evaluation is the central element of a fuzzy logic inference
program.
– Processes a list of rules from the knowledge base using current fuzzy input
values from RAM to produce a list of fuzzy outputs in RAM.
§ The CPU12 offers two variations of rule evaluation instructions:
– The REV instruction provides for unweighted rules (all rules are considered
to be equally important).
– The REVW instruction is similar but allows each rule to have a separate
weighting factor which is stored in a separate parallel data structure in the
knowledge base.
§ The fuzzy and operator corresponds to the mathematical
minimum operation and the fuzzy or operation corresponds to
the mathematical maximum operation.

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 51
Evaluate Rules

ldab #7 ; loop count

eval: clr 1,y+ ; clear a fuzzy out and inc pointer
dbne b, eval ; loop to clr all fuzzy outs
ldx #rule_start ; point at first rule element
ldy #fuz_ins ; point at fuzzy ins and outs
ldaa #$ff ; init A (and clears V-bit)
rev ; process rule list

A ins outs
rules Y
mv1 f1
X
antecedant $FE consequence $FE mv2 f2
antecedant $FE consequence $FE mv3 f3
antecedant antecedant $FE consequence $FE mv4 f4
antecedant $FE consequence $FF mv5 f5
mv6 f6
mv7 f7

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 52
De-Fuzzify

defuz: ldy #fuz_out ; point at fuzzy outputs

ldx #sgltn_pos ; point at singleton positions
ldab #7 ; 7 fuzzy outs per COG output
wav ; calculate sums for wtd av
ediv ; final divide for wtd av
tfr y,d ; move result to A:B
stab cog_out ; store system output

outs singletons n

åS F
Y X
f1 s1
f2 s2 i i
f3
f4
s3
s4 system _ output = i =1
n
f5
f6
f7
s5
s6
s7
åF
i =1
i

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 53
Program Measurements
§ Assume: 2 inputs, 1 output, 25 non-conjunctive rules and 7
membership functions on ins and out.
– Data structure costs: 160 bytes
– Program structure costs: <100 bytes
– Execution time: 75uS (response time) or about 13Khz maximum cycle time

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 54
Example: Multiple Fuzzy Controllers for a Quadruped Robot

position Balance
tilt sensors
energy Controller

limb limb limb limb

controller controller controller controller

Sensor joint Sensor joint Sensor joint Sensor joint

Actua tor controller Actua tor controller Actua tor controller Actua tor controller

Sensor joint Sensor joint Sensor joint Sensor joint

Actua tor controller Actua tor controller Actua tor controller Actua tor controller

§ hierarchy of 13 fuzzy controllers

§ standing level on uneven surfaces
§ maintaining balance dynamically

Motorola, Inc Chapman - ualberta.ca

2/22/18 EE382N-4 Class Notes 55
Reconfigurable Architectures: Tensilica LX

EE382N-4 Class Notes

Taxonomy of Reconfigurable Architectures

RECONFIGURABLE ARCHITECTURES
(R-SOC)

FINE GRAIN MULTI GRANULARITY COARSE GRAIN

(FPGA) (Heterogeneous) (Systolic)

Processor + Tile-Based
Coprocessor Architecture

Island Hierarchical Coarse Grain Fine Grain Mesh Linear Hierarchical

Topology Topology Coprocessor Coprocessor Topology Topology Topology

Xilinx Virtex Altera Stratix Chameleon Pleiades aSoC RAW Systolic Ring DART
Xilinx Spartran Altera Apex REMARC Garp E-FPFA CHESS RaPiD FPFA
Atmel AT40K Altera Cyclone Morphosys FIPSOC MATRIX PipeRench
Lattice ispXPGA Triscend E5 KressArray
Triscend A7 Systolix Pulsedsp
Xilinx Virtex-II Pro
Altera Excalibur
Atmel FPSIC
Tensislica

EE382N-4 Class Notes 57

Xtensa LX – Basic Architecture
§ Processor Configuration
– Energy Usage: 76 µW/MHz , 47 µW/MHz ( 5 and 7 stage pipeline)
– Clock Speed: 350 MHz, 400 MHz (5 and 7 stage pipeline)
– Cache:
• up to 32 KB and 1,2,3,4 way set associative cache
– 64 general purpose physical registers (32-bits)
– 6 special purpose registers
– Extensible via use of TIE and FLIX instructions
– Zero over head loops

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 58
Xtensa LX Architecture
§ 32-bit ALU
§ 1 or 2 Load/Store Model
§ Registers
– 32-bit general purpose register file
– 32-bit program counter
– 16 optional 1-bit Boolean registers
– 16 optional 32-bit floating point registers
– 4 optional 32-bit MAC16 data registers
– Optional Vectra LX DSP registers
§ General Purpose AR Register File
– 32 or 64 registers
– Instructions have access through “sliding window” of 16 registers. Window
can rotate by 4, 8, or 12 registers
– Register window reduces code size by limiting number of bits for the address
and eliminated the need to save and restore register files

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 59
Xtensa LX Architecture

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 60
Xtensa LX Pipelining

§ 5 or 7 Stage Pipeline Design:

– 5 stage pipeline has stages: IF,
Register Access, Execute, Data-
Memory Access, and register
writeback
– 5 stage pipeline accesses memory
in two stages. 7 stage pipeline is
extended version of the 5 stage
pipeline with extra IF and Memory
Access stage. Extra stages provide
more time for memory access.
Designer can run at a higher clock
speed while using slower memory
to improve performance

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 61
Xtensa LX Instruction Set

§ The Xtensa ISA consists of 80

core instructions including both
16 and 24 bit instructions

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 62
Xtensa LX ISA – Building Blocks
§ Floating Point Unit
– 32-bit, single precision, floating-point coprocessor
§ Vectra LX DSP Engine
– Optimized to handle Digital Signal Processing Applications

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 63
Vectra LX DSP Engine
§ FLIX (Flexible length instruction extension) Based
§ Vectra LX instructions encoded in 64 bits.
– Bits 0:3 of a Xtensa instruction determine its length and format, the bits
have a value of 14 to specify it is a Vectra LX instruction
– Bits 4:27 – contain either Xtensa LX core instruction or Vectra LX Load or
Store instruction
– Bits 28:45 – contains either a MAC instruction or a select instruction
– Bits 46:63 – contains either ALU and shift instructions or a load and store
instruction for the second Vectra LX load/store unit

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 64
TIE (Tensilica Instruction Extension)
§ Method used to extend the processor’s architecture and
instruction set using the TIE compiler

§ TIE Compiler
– Generates file used to configure software development tools so that they
recognize TIE Extensions
– Estimates hardware size of new instruction
– You can modify application code to take advantage of the new instruction
and simulate to decide if the speed advantage is worth the hardware cost

§ TIE Syntax
– Resembles Verilog
– More concise than RTL (it omits all sequential logic, pipeline registers, and
initialization sequences.
– The custom instructions and registers described in TIE are part of the
processor’s programming model.
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 65
TIE Queues and Ports
§ New way to communicate with external devices
§ Queues: data can be sent or read through queues. A queue is
defined in the TIE and the compiler generates the interface
signals required for the additional port needed to connect to the
queue. Logic is also automatically generated
§ Import-wire: processor can sample the value of an external signal
§ Export-state: drive an output based on TIE

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 66
TIE
§ TIE Combines multiple operations into one using:
– Fusion
– SIMD/Vector Transformation
– FLIX

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 67
Fusion
§ Allows you to combine dependent operations into a single
instruction

Consider: computing the average of two arrays

unsigned short *a, *b, *c;
. . .
for( i = 0; i < n; i++)
c[i] = (a[i] + b[i]) >> 1;

Two Xtensa LX Core instructions required, in addition to load/store

instructions

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 68
Fusion
Fuse the two operations into a single TIE instruction

operation AVERAGE{out AR res, in AR input0, in AR input1}{}{

wire [16:0] tmp = input0[15:0] + input1[15:0];
assign res = temp[16:1];
}

Essentially an add feeding a shift, described using standard

Verilog-like syntax

Implementing the instruction in C/C++

#include <xtensa/tie/average.h>
unsigned short *a, *b, *c;
. . .
for( i = 0; i < n; i++)
c[i] = AVERAGE(a[i] + b[i]);

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 69
SIMD/Vector Transformation
§ Single Instruction, Multiple Data
– Fusing instructions into a “vector”
– Allows replication of the same operation multiple times in one instruction

Consider: Computing four averages in one instruction

The following TIE code computes multiple iterations in a single instruction by
combining Fusion and SIMD

regfile VEC 64 8 v

operation VAVERAGE{out VEC res, in VEC input0, in VEC input1} {} {

wire [67:0] tmp = {input0[63:48] + input1[63:48],
input0[47:32] + input1[47:32],
input0[31:16] + input1[31:16],
input0[15:0] + input1[15:0] };
assign res = {tmp[67:52], tmp[50:35], tmp[33:18], tmp[16:1]};
}

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 70
SIMD/Vector Transformation
§ Computing four 16-bit averages
– Each data vector must be 64 bits (4 x 16 bits)
§ Create new register file, new instruction
– VEC - eight 64-bit registers to hold data vectors
– VAVERAGE - takes operands from VEC, computes average, saves results into
VEC

VEC a, b, *c;

for (i = 0; i < n; i += 4){
c[i] = VAVERAGE( a[i], b[i] );}

§ New data type recognized

– TIE automatically creates new load, store instructions to move 64-bit vectors
between VEC register file and memory

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 71
FLIX
§ Flexible length instruction extension
– Key in extreme extensibility
– Huge performance gains possible
– Code size reduction without code bloat
§ Similar to VLIW
§ Created by XPRES Compiler

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 72
FLIX - Usage
§ Used selectively when parallelism is needed
§ Avoids code bloat
§ Used seamlessly with standard 16- and 24-bit instructions

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 73
XPRES Compiler
§ Powerful synthesis tool
– Creates tailored processor descriptions
– Run on native C/C++ code
§ Three optimizations methods
§ Returns optimal configurations along with pros and cons
(tradeoffs)
§ Analyzes C/C++ code
§ Generates possible configurations
§ Compares performance criteria to silicon size (cost)
§ Returns possible configurations

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 74
XPRES Compiler
§ Analyzes C/C++ code
§ Generates possible configurations
§ Compares performance criteria to silicon size (cost)
§ Returns possible configurations

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 75

EEE415 Lect Intro
No ratings yet
EEE415 Lect Intro
61 pages
EE382N-4 Advanced Microcontroller Systems: Course Overview
No ratings yet
EE382N-4 Advanced Microcontroller Systems: Course Overview
33 pages
Lecture 13 Accelerators
No ratings yet
Lecture 13 Accelerators
31 pages
3.course Information Sheet
No ratings yet
3.course Information Sheet
7 pages
BSC 2020 21 Update Proposal
No ratings yet
BSC 2020 21 Update Proposal
11 pages
Lecture02 - High-Level Digital Design Automation
No ratings yet
Lecture02 - High-Level Digital Design Automation
34 pages
ACA Chapter 1
100% (1)
ACA Chapter 1
106 pages
Intro Microprocessors Embedded Systems Lecture Presentation 1 PDF
No ratings yet
Intro Microprocessors Embedded Systems Lecture Presentation 1 PDF
99 pages
Computer Architecture Insights
No ratings yet
Computer Architecture Insights
29 pages
CS244-Introduction To Embedded Systems and Ubiquitous Computing
No ratings yet
CS244-Introduction To Embedded Systems and Ubiquitous Computing
38 pages
Unit I Part 1 Introduction Design Methodologies
No ratings yet
Unit I Part 1 Introduction Design Methodologies
46 pages
CSE 820 Graduate Computer Architecture: Dr. Enbody
No ratings yet
CSE 820 Graduate Computer Architecture: Dr. Enbody
25 pages
Lecture 5 8051 Architecture
No ratings yet
Lecture 5 8051 Architecture
22 pages
Microprocessor Architecture Programming and Applications With The 8085 by Ramesh S Gaonkar
0% (3)
Microprocessor Architecture Programming and Applications With The 8085 by Ramesh S Gaonkar
9 pages
Iot Unit 1
No ratings yet
Iot Unit 1
17 pages
ECE153a/253 Embedded Systems Class Overview
No ratings yet
ECE153a/253 Embedded Systems Class Overview
41 pages
Lect 1
No ratings yet
Lect 1
34 pages
Hardware View of The Embedded Systems-1 - Anz
No ratings yet
Hardware View of The Embedded Systems-1 - Anz
29 pages
Intel 8085 Microprocessor Guide
No ratings yet
Intel 8085 Microprocessor Guide
86 pages
MUP All Lectures Merged
No ratings yet
MUP All Lectures Merged
673 pages
Lecture HWA
No ratings yet
Lecture HWA
11 pages
Microprocessors and Microcontrollers
No ratings yet
Microprocessors and Microcontrollers
2 pages
Microporocessor by Gaonkar
100% (1)
Microporocessor by Gaonkar
291 pages
Syllabus: Microcontrollers 4 Sem Ece/Tce
No ratings yet
Syllabus: Microcontrollers 4 Sem Ece/Tce
60 pages
722 9 5 2011 Review
No ratings yet
722 9 5 2011 Review
101 pages
Microcontroller V Sem - Assessment Schedule A.Y 2024-25
No ratings yet
Microcontroller V Sem - Assessment Schedule A.Y 2024-25
5 pages
Microprocessor Systems: - A Brief Run Down
No ratings yet
Microprocessor Systems: - A Brief Run Down
28 pages
CT3 Armsoc QP Answerkey
No ratings yet
CT3 Armsoc QP Answerkey
13 pages
ELE Syllabus
No ratings yet
ELE Syllabus
2 pages
PPT#01
No ratings yet
PPT#01
30 pages
Course Overview and 8051 Architecture Rv01
100% (1)
Course Overview and 8051 Architecture Rv01
22 pages
CH 1
No ratings yet
CH 1
5 pages
Advanced Computer Systems Course
No ratings yet
Advanced Computer Systems Course
31 pages
Micro Lect Notes 1
No ratings yet
Micro Lect Notes 1
11 pages
CompTIA A+ Core 1
No ratings yet
CompTIA A+ Core 1
24 pages
Sample 3186-1
No ratings yet
Sample 3186-1
17 pages
Computer Architecture Slides
No ratings yet
Computer Architecture Slides
274 pages
Comp Syl
No ratings yet
Comp Syl
2 pages
EENG 371 Lecture Notes
No ratings yet
EENG 371 Lecture Notes
114 pages
ACOE343 - Real-Time: Embedded Processor Systems
No ratings yet
ACOE343 - Real-Time: Embedded Processor Systems
79 pages
8051 Microcntroller
No ratings yet
8051 Microcntroller
80 pages
Notes Unit 1 New
No ratings yet
Notes Unit 1 New
120 pages
Thisyear Lect
No ratings yet
Thisyear Lect
420 pages
E-Note 33873 Content Document 20250327024819PM
No ratings yet
E-Note 33873 Content Document 20250327024819PM
132 pages
Microprocessor Basic Programming
100% (1)
Microprocessor Basic Programming
132 pages
Unit-1 EE3404 MPMC
No ratings yet
Unit-1 EE3404 MPMC
137 pages
Sample 7489
No ratings yet
Sample 7489
11 pages
Microcontroller & 8051
100% (1)
Microcontroller & 8051
16 pages
Share Micro Processer Note
No ratings yet
Share Micro Processer Note
81 pages
Advanced Computer Architecture Course
No ratings yet
Advanced Computer Architecture Course
28 pages
Lecture 0 Overview of Embedded System
100% (1)
Lecture 0 Overview of Embedded System
62 pages
Microprocessor and Interfacing - Lecture - 1 (New)
No ratings yet
Microprocessor and Interfacing - Lecture - 1 (New)
18 pages
Ug-r20-Eee & Ece MPMC Syllabus
No ratings yet
Ug-r20-Eee & Ece MPMC Syllabus
3 pages
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
44 pages
ELECH473 Th06
No ratings yet
ELECH473 Th06
65 pages
BECE403E - Embedded Systems Design - Module 2
No ratings yet
BECE403E - Embedded Systems Design - Module 2
81 pages
UNZA Computer Engineering Course
No ratings yet
UNZA Computer Engineering Course
366 pages
DCP Notes
100% (1)
DCP Notes
201 pages
EE382N-4 Advanced Microcontroller Systems: Embedded Software Optimization and Power Aware Software Development
No ratings yet
EE382N-4 Advanced Microcontroller Systems: Embedded Software Optimization and Power Aware Software Development
31 pages
Lecture 6
No ratings yet
Lecture 6
91 pages
Ee445M: Embedded and Real Time Systems: Study Guide Set #01
No ratings yet
Ee445M: Embedded and Real Time Systems: Study Guide Set #01
4 pages
Sistemas Embarcados e Tempo Real
No ratings yet
Sistemas Embarcados e Tempo Real
6 pages
Ee445M: Embedded and Real Time Systems: Study Guide Set #01
No ratings yet
Ee445M: Embedded and Real Time Systems: Study Guide Set #01
4 pages
Agilet's GSM GPRS Reference
No ratings yet
Agilet's GSM GPRS Reference
882 pages
VSDC Spring 2016 Homework 01
No ratings yet
VSDC Spring 2016 Homework 01
12 pages
UUV HILS for Engineers
No ratings yet
UUV HILS for Engineers
6 pages
Basic Electronics NOR Gate
No ratings yet
Basic Electronics NOR Gate
16 pages
Changan Oshan X5 Plus Parameters
No ratings yet
Changan Oshan X5 Plus Parameters
16 pages
BP&R - Oct 19 - Digital PDF
No ratings yet
BP&R - Oct 19 - Digital PDF
68 pages
Question Paper Code:: Reg. No.
No ratings yet
Question Paper Code:: Reg. No.
2 pages
SP 1032 Specification - For - The - Design - of - Pipeline - Block - Valv
100% (2)
SP 1032 Specification - For - The - Design - of - Pipeline - Block - Valv
17 pages
College Network Design Guide
100% (2)
College Network Design Guide
46 pages
Muhammad Ijaz Theses
No ratings yet
Muhammad Ijaz Theses
36 pages
ALR 3103 - PPT - 5 Biodiversity Conservation For Sustainable Utilization
No ratings yet
ALR 3103 - PPT - 5 Biodiversity Conservation For Sustainable Utilization
45 pages
Type B4 Connection Details
No ratings yet
Type B4 Connection Details
9 pages
SSP 861603 - EN - Tire Pressure Monitoring Systems
No ratings yet
SSP 861603 - EN - Tire Pressure Monitoring Systems
42 pages
Precision Locators - Optimized Precision For Your Utility
No ratings yet
Precision Locators - Optimized Precision For Your Utility
12 pages
Economy Mind Map for IAS Prep
0% (1)
Economy Mind Map for IAS Prep
10 pages
Connection: EN 1993-1-8 Definitions
100% (1)
Connection: EN 1993-1-8 Definitions
49 pages
Essential KPIs for CFO Dashboards
0% (1)
Essential KPIs for CFO Dashboards
5 pages
Q1 2019 - Investor Letter
No ratings yet
Q1 2019 - Investor Letter
3 pages
How Do You Import DBs in A STEP 7 (TIA Portal) Project That Were Created in Other Projects or With Earlier Versions of STEP 7
0% (1)
How Do You Import DBs in A STEP 7 (TIA Portal) Project That Were Created in Other Projects or With Earlier Versions of STEP 7
3 pages
TheColdWar AWorldHistory
No ratings yet
TheColdWar AWorldHistory
3 pages
Field Study 2 Course Syllabus
No ratings yet
Field Study 2 Course Syllabus
15 pages
Crema Johnnie Christmas PDF Download
100% (1)
Crema Johnnie Christmas PDF Download
15 pages
Disown Deed
56% (25)
Disown Deed
4 pages
Design of Machine Elements Question Bank
No ratings yet
Design of Machine Elements Question Bank
14 pages
TOP 5 TDI Strategies
100% (6)
TOP 5 TDI Strategies
15 pages
Palm Tree Energy Disspation
No ratings yet
Palm Tree Energy Disspation
18 pages
RRL-Soil Analyzer and Plant Suggestion (JPG) - 2
No ratings yet
RRL-Soil Analyzer and Plant Suggestion (JPG) - 2
7 pages
Storm 125 New APRILIA BS4 Catalogue - OEMSHIP
No ratings yet
Storm 125 New APRILIA BS4 Catalogue - OEMSHIP
131 pages
Wholesale and Retail Sales
No ratings yet
Wholesale and Retail Sales
46 pages
Jsa For Mechanical Activity: H/M/L H/M/L
No ratings yet
Jsa For Mechanical Activity: H/M/L H/M/L
1 page
Videopad Versi
No ratings yet
Videopad Versi
17 pages
2025 Unizulu General Calendar 101224
No ratings yet
2025 Unizulu General Calendar 101224
200 pages
TATA CAPITAL 1,2 & 35 Complete PDF Date
No ratings yet
TATA CAPITAL 1,2 & 35 Complete PDF Date
4 pages