0% found this document useful (0 votes)
127 views75 pages

EE382N-4 Advanced Microcontroller Systems: Accelerators and Co-Processors

This document summarizes a class on accelerators and co-processors. It discusses the differences between accelerators and co-processors, with accelerators appearing as devices on a bus controlled by registers, while co-processors execute instructions dispatched by the CPU. Examples are given of tightly and loosely coupled co-processors. Common applications of hardware acceleration are also outlined, such as for graphics, audio/video processing, and encryption. Decision trees are presented for when and how to implement a hardware accelerator. Finally, different programming models for hybrid CPU-accelerator systems are described.

Uploaded by

hilgad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views75 pages

EE382N-4 Advanced Microcontroller Systems: Accelerators and Co-Processors

This document summarizes a class on accelerators and co-processors. It discusses the differences between accelerators and co-processors, with accelerators appearing as devices on a bus controlled by registers, while co-processors execute instructions dispatched by the CPU. Examples are given of tightly and loosely coupled co-processors. Common applications of hardware acceleration are also outlined, such as for graphics, audio/video processing, and encryption. Decision trees are presented for when and how to implement a hardware accelerator. Finally, different programming models for hybrid CPU-accelerator systems are described.

Uploaded by

hilgad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

EE382N-4

Advanced Microcontroller Systems

Accelerators and Co-Processors

Mark McDermott

Spring 2018

EE382N-4 Class Notes


Agenda
§ Taxonomy of Hardware Acceleration
§ Co-Processors
– Tightly coupled
– Loosely coupled
• FP Matrix Multiplier
• MC68332 Time Processing Unit
§ ISA Enhancements:
– HC12 Fuzzy Logic Acceleration
§ Reconfigurable architectures
– Tensilica

2/22/18 EE382N-4 Class Notes 2


Taxonomy of Hardware Acceleration

EE382N-4 Class Notes


Accelerator vs. co-processor
§ A co-processor executes instructions.
– Instructions are dispatched by the CPU.

§ An accelerator appears as a device on the bus.


– The accelerator is controlled by registers.

2/22/18 EE382N-4 Class Notes 4


Hardware Acceleration
§ Ad hoc interface to controlling processor
– Usually memory-mapped
– Bus-based, FIFO, or register data interfaces

§ Typically, the processor transfers data to the accelerator, issues a


go command, and then collects result data later.
– Polled or interrupt-based interface

§ Accelerator may have its own path to/from memory

§ Often fixed function but can be microcoded for programmability

2/22/18 EE382N-4 Class Notes 5


Common HW Acceleration Applications
§ Graphics
§ Data Compression
§ Audio/Video Encoding/Decoding
§ Image sensing and processing
§ Data Encryption: RSA, DES, AES
§ Router frame queuing, port selection

2/22/18 EE382N-4 Class Notes 6


Common HW Acceleration Applications
§ Embedded Systems
– FPGAs appearing in set-top boxes, routers, audio equipment, etc.
• Advantages
– Performance close to ASIC, sometimes at much lower cost
Ø Many other embedded systems still use ASIC due to high volume
> Cell phones, iPod, game consoles, etc.
– Reconfigurable!
Ø If standards change, architecture is not fixed
Ø Can add new features after production

2/22/18 EE382N-4 Class Notes 7


Common HW Acceleration Applications
§ High-performance embedded computing (HPEC)
– High-performance/super computing with special needs (low power, low
size/weight, etc.)
• Satellite image processing
• Target recognition in a UAV
– Advantages
• Much smaller/lower power than a supercomputer
• Fault tolerance

EE382N-4 Class Notes 8


Common HW Acceleration Applications
§ High-performance computing (HPC)
– Cray, SGI, DRC, GiDEL, Nallatech, XtremeData
• Combine high-performance microprocessors with FPGA accelerators
– Novo-G
• 192 Altera Stratix III FPGAs integrated with 24 quad-core microprocessors
§ Advantages
– HPC used for many scientific apps
• Low volume, ASIC rarely feasible, microprocessor too slow
– Lower power consumption
• Increasingly important
• Cooling and energy costs are dominant factor in total cost of ownership

EE382N-4 Class Notes 9


Common HW Acceleration Applications
§ General-purpose computing
– Ideal situation: desktop machine/OS uses a programmable accelerator to
speedup up all applications (similar to GPU trend)
– Problems
• The accelerator can be very fast, but not for all applications
– Generally requires parallel algorithms
• Coding constructs used in many applications not appropriate for hardware
– Subject of tremendous amount of past and likely future research
§ How to use extra transistors on general purpose CPUs?
– More cache
– More microprocessor cores
– GPU
– FPGA?
– Something else?

2/22/18 EE382N-4 Class Notes 10


Decision Tree: When do you use a hardware accelerator?

Easy

Can an existing algorithm be implemented using existing ISA?

Can a new algorithm be devised to solve problem using existing ISA?

Can API be modified to expose necessary functionality or make it easier to exploit?

Can HW accelerator be added as a co-processor instruction

Can ISA be modified to better support algorithm?

Can datapath be modified to better support algorithm, without breaking others?


Hard

2/22/18 EE382N-4 Class Notes 11


Four Programmers Models of Accelerator Design

CPU Accelerator CPU Accelerator


Application Application
OS
No OS Service (in simple
Base - HW I/F only embedded systems)

CPU Accelerator CPU Accelerator

Application Application

mmap() dev() driver


OS OS

OS service – Accelerator accessed Virtualized Device with


as a user space memory mapped OS scheduling support
I/O device

Perkowski, psu.edu

2/22/18 EE382N-4 Class Notes 12


Hybrid Hardware/Software Execution Model
§ Hardware Accelerator as a Kernel Module
– Seamless integration of hardware accelerators into
the Linux software stack for use by mainstream
Source code
applications
Compiler analysis/transformations Human designed – The KM approach enables transparent interchange
Compile
hardware of software and hardware components
Synthesis
Time

User § Application level execution model


Runtime DLL
– Compiler deep analysis and transformations
Application generate CPU code, hardware library stubs and
synthesized components
Linker/Loader Resource manager – FPGA bitmaps as hardware counterpart to existing
software modules.
Kernel – Same dynamic linking library interfaces and stubs
Runtime
OS modules

Linux OS apply to both software and hardware


memory
implementation

§ OS resource management
CPU
FPGA – Services (API) for allocation, partial reconfiguration,
accele-
devices rators saving and restoring the status, and monitoring
– Multiprogramming scheduler can pre-fetch
hardware accelerators in time for next use
Soft object
– Control the access to the new hardware to ensure
User level function or device driver:
Hard object trust under private or shared use

Perkowski, psu.edu

2/22/18 EE382N-4 Class Notes 13


CPU-Accelerator Interface
§ Accelerator registers provide control registers for CPU
§ Data registers can be used for small data objects
§ Accelerator may include special-purpose read/write logic (DMA
hardware)
– Especially valuable for large data transfers

2/22/18 EE382N-4 Class Notes 14


CPU-Accelerator Interface Example

Block 6
Accelerator
§ AXI
RAM – 32 bit Bus
5
PL – Access to DRAM data &
programmable logic fabric
AXI – 1/2 CPU frequency
2 – Big penalty if bus is busy during
Slave I/C PS first attempt to access bus

DDR
1 3
4
DDR § AHB (AMBA High Speed Bus)
ARM Core
Controller
– 64 bit bus
– Runs at CPU clock frequency
Pipelined
First Access Arbitration – Access to DDR Controller to
Bus Access
provide addresses to SDRAM
Read Write Read Write
1 ARM à I/C 2 2 2 2
2 I/Cà AXI 8 8 3 3 5
3 AHB à DDRC 4 4 4 4
4 DDRC à DRAM 8 9 3 3 5
5 AXI ↔ BRAM 20 20 8 8 12
6 BRAM ↔ ACC 2 2 2 2 Perkowski, psu.edu

2/22/18 EE382N-4 Class Notes 15


Hardware Accelerator Interface: Interrupts or Polling?
§ Polling interfaces usually require the processor to read a
memory-mapped register to determine the state of the
accelerator.
– Can the accelerator accept new input data?
– Is the accelerator done with its current task?
– Has the accelerator generated an error condition?

§ Polling interfaces offer minimal latency between the setting of a


condition on the accelerator and its
discovery by the controlling processor.
– But processor isn’t doing useful work while it polls…

2/22/18 EE382N-4 Class Notes 16


Hardware Accelerator Interface: Interrupts or Polling?
§ Interrupt-based interfaces allow the accelerator to signal
conditions to the controlling processor.
– Interrupt latency is longer than is achievable via the polling method.
– But the processor can more easily proceed with other work while the
accelerator is busy with a task.

§ Interrupts more efficient for coarse grained parallelism (i.e.,


larger tasks with looser and less frequent synchronization
requirements)

§ Interrupts may not work for real-time control tasks with tight
schedules

2/22/18 EE382N-4 Class Notes 17


Typical CPU à Accelerator Transaction

Time à
Application Operating System Hardware
open(/dev/accel); /* only once*/
Enable Accelerator AXI
Access for

Application
/* construct macroblocks */
macroblock = …
ARM Memory
syscall(&macroblock,
num_blocks) Data copy

AXI
Flush Cache Range ARM Memory

Setup DMA Transfer AXI DMA


ARM Controller
Poll AXI Accelerator
ARM (Executing)

AXI DMA
Setup DMA Transfer ARM Controller

… Invalidate Cache Range AXI


/* macroblock now has
transformed data */ ARM Memory

Data Copy
AXI
ARM Memory
Perkowski, psu.edu
2/22/18 EE382N-4 Class Notes 18
Caching Issues with Accelerators
§ Main memory provides the primary data transfer mechanism to
the accelerator.
§ Programs must ensure that caching does not invalidate main
memory data.
– CPU reads location S.
– Accelerator writes location S.
– CPU writes location S.
• BAD – Program will not see proper value of S stored in the cache

The bus interface may provide mechanisms for accelerators to


tell the CPU of required cache changes…

2/22/18 EE382N-4 Class Notes 19


Synchronization and Memory
§ As with cache, main memory writes to shared memory may cause
invalidation (memory incoherence).
– CPU reads location S
– Accelerator writes S
– CPU writes S

§ Many CPU buses implement test-and-set atomic operations that


the accelerator can use to implement a semaphore. This can
serve as a highly efficient means of synchronizing inter-process
Communications (IPC)

2/22/18 EE382N-4 Class Notes 20


Software Versus Hardware Acceleration

Overhead is a
major issue!

Perkowski, psu.edu
2/22/18 EE382N-4 Class Notes 21
Device Driver Access Cost

Perkowski, psu.edu

2/22/18 EE382N-4 Class Notes 22


Co-Processors

EE382N-4 Class Notes


Tightly Coupled Coprocessors
§ Integrated with processor control logic
– Task typically completes in a few cycles
– Small amounts of data
– Processor stalls waiting for the coprocessor
– Communication with coprocessor typically via registers and dedicated
control signals
– Coprocessor ports
• Examples:
– ARM (ARM7TDMI);
– Texas Instruments TMS320C55x processors

2/22/18 EE382N-4 Class Notes 24


Tightly-Coupled Coprocessor Example

Memory
System

T
Instruction C
C
decode I/F
TCC

Register file

TCC instructions
TMS320C55x

2/22/18 EE382N-4 Class Notes 25


Loosely-Coupled Coprocessors
§ Loosely-Coupled Coprocessors
– Used for larger tasks than is the case for
tightly-coupled coprocessors
– Task runs in parallel with main processor
– May take many cycles per task
– Large amounts of data that coprocessor may access independent of main
processor
– May or may not use the standard coprocessor interface

2/22/18 EE382N-4 Class Notes 26


Loosely-Coupled FP Matrix Multiplier Coprocessor

https://www.xilinx.com/support/documentation/application_notes/xapp1170-zynq-hls.pdf

2/22/18 EE382N-4 Class Notes 27


Accelerator Coherency Port (ACP)
§ Accelerator coherency port (ACP) is a 64-bit AXI slave interface
on the SCU that provides an asynchronous cache-coherent access
point directly from the PL to the Cortex-A9 MP-Core processor
subsystem.
§ A range of system PL masters can use this interface to access the
caches and the memory subsystem exactly the way the APU
processors do to simplify software, increase overall system
performance, or improve power consumption.
– Interface acts as a standard AXI slave and supports all standard read and
write transactions without any additional coherency requirementsplaced on
the PL components. Therefore, the ACP provides cache-coherent access from
the PL to ARM caches while any memory local to the PL are non-coherent
with the ARM.
http://www.xilinx.com/support/answer-navigation/answer-keyword-
search.html?type=answerRecord&analytics=AnswersDatabase&searchKeywords=ZYNQ+ACP

https://www.xilinx.com/support/documentation/sw_manuals/ug1046-ultrafast-design-methodology-guide.pdf

2/22/18 EE382N-4 Class Notes 28


ACP Usage
§ The ACP provides a low latency path between the PS and the
accelerators implemented in the PL when compared with a legacy
cache flushing and loading scheme. Steps that must take place in
an example of a PL-based accelerator are as follows:
– The CPU prepares input data for the accelerator within its local cache space.
– The CPU sends a message to the accelerator using one of the general
purpose AXI master interfaces to the PL.
– The accelerator fetches the data through the ACP, processes the data, and
returns the result through the ACP.
– The accelerator sets a flag by writing to a known location to indicate that the
data processing is complete. Status of this flag can be polled by the
processor

2/22/18 EE382N-4 Class Notes 29


ACP Caveats
§ NOTE: When compared to a tightly-coupled coprocessor, ACP
access latencies are relatively long. Therefore, ACP is not
recommended for fine-grained instruction level acceleration.
§ For coarse-grain acceleration such as video frame-level
processing, ACP does not have a clear advantage over traditional
memory-mapped PL acceleration because the transaction
overhead is small relative to the transaction time, and might
potentially cause undesirable cache thrashing.
§ ACP is therefore optimal for medium-grain acceleration, such as
block-level crypto accelerator and video macro-block level
processing.

2/22/18 EE382N-4 Class Notes 30


Micro-coded Co-Processor:
MC 68332 Time Processing Unit

EE382N-4 Class Notes


MC68332 Time Processing Unit
§ The TPU3 can be viewed as a special-purpose microcomputer that performs a
programmable series of two operations, match and capture.
§ The microengine uses microcode to perform functions.

Host Scheduler Service Timer


Interface Control Channels
Requests
Inter-Module Bus (IMB)

System Channel 0

Channel
Configuration Channel 1
CLK time
Development base
Support
and Test Pins
Microengine
Channel
Control Control
Data Control and
Store Data
Store
Parameter
RAM Execution
Unit
Channel 15

Motorola, Inc. Kurt Keutzer UCB


2/22/18 EE382N-4 Class Notes 32
Time Processing Unit
TPU Preprogrammed Functions:
§ Semi-autonomous microcontroller
§ Operates concurrently with CPU
• Schedules tasks
• Processes ROM instructions
• Accesses shared data with CPU
• Performs Input/Output operations
§ Programmable series of 2 operations
• Match
• Capture
§ Each operation is called an ``event’’
§ A pre-programmed series of event is
called a ``function’’

Motorola, Inc. Kurt Keutzer UCB


2/22/18 EE382N-4 Class Notes Page 33
Time Bases

§ Two sixteen-bit counters provide


time bases for all
§ Pre-scalers controlled by CPU via
bit-fields in TPU module
configuration register TPUCMR
§ Current values accessible via TCR1
and TCR2 registers
§ TCR1, TCR2 can be read/written
by TPU microcode- not available
to CPU
§ TC1 qualified by system clock
§ TC2 qualified by system clock or
external clock
Motorola, Inc. Kurt Keutzer UCB
2/22/18 EE382N-4 Class Notes 34
Timer Channels

§ Sixteen channels
– each one connect to a MCU pin
§ Each channel has symmetric
hardware:
§ Event register
– 16-bit capture register
– 16-bit compare/match register
– 16-bit comparator
§ Pin control logic - pin direction
determined by TPU microengine

Motorola, Inc. Kurt Keutzer UCB


2/22/18 EE382N-4 Class Notes 35
Scheduler

§ Determines which of sixteen


channels is serviced by the
microengine
§ Channel can request service
for one of four reasons
– host service
– link to another channel
– match event
– capture event
§ Host system assigns to each
channel a priority
– high
– middle
– low

Motorola, Inc. Kurt Keutzer UCB


2/22/18 EE382N-4 Class Notes 36
Microengine

§ Executes microcoded
functions for selected
channel.
§ Returns control to
scheduler when
completed.

Motorola, Inc. Kurt Keutzer UCB


2/22/18 EE382N-4 Class Notes 37
WCS – Writeable Control Store

§ The microcode in the TPU is


hard coded into a mask
programmable ROM.
§ To facilitate microcode
development and debug, a
block RAM can be used to
replace the ROM providing a
“Writeable Control Store”
capability.

68332 Die Photo

2/22/18 EE382N-4 Class Notes 38


ISA Enhancements: Fuzzy Logic Acceleration

EE382N-4 Class Notes


Overview
§ Fuzzy controllers provide a unique way of controlling complex
systems.
– If written in software they can take a long time to execute, limiting their
application in real-time systems.
– Using dedicated logic and a few new assembler instructions, a
microcontroller can be enhanced to execute a fuzzy controller quite
efficiently.

Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 40
Fuzzy Sets
§ In the real world, very few things belong to a single classification
and often the boundaries are not clear
§ Fuzzy sets, then, is the extension of regular (Aristotelian) sets:
– sets can overlap
– members of a set have a degree of membership instead of just belonging to
or not
§ Fuzzy sets can be given linguistic labels such as warm, positive
big, very cold, etc
§ Fuzzy logic allows the sets to be computed
– boundary conditions same as two-level logic

Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 41
Input Memberships
§ The degree to which an input belongs to a classification, is called
its membership value
§ The classifications can overlap so that an input value can belong
to more than one, hence the ability to fuzzify the input

negative big positive big

1.0 ($FF)

degree of zero
negative positive
membership
small small

0.0 ($00)
$00 $40 $80 $C0 $FF

ra nge of values

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 42
Fuzzy Controller

Inputs
Error
NB NS ZE PS PB
NB NS ZE PS PB NB ZE PS PB PB PB

Change
of Error
NS NS ZE PS PS PB
ZE NB NS ZE PS PB
PS NB NS NS ZE PS
Output PB NB NB NB NS ZE

Output Adjustment
NB NS ZE PS PB

§ error = setting - position


§ change of error = error - previous error
§ output is an adjustment to current output
Motorola, Inc Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 43
Fuzzy Rules
§ IF error is positive big THEN output is much smaller
§ IF error is small and change of error is positive big THEN output is
smaller
§ IF error is small and change of error is positive small THEN output
is a little smaller
§ IF error and change of error are zero THEN output is zero

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 44
Fuzzy Controller: Implementation
§ 68HC12 microcontroller (Motorola)
– Fuzzy logic instructions
§ Simple fuzzy system response on 68HC11:
– 750 ms
§ Simple fuzzy system response on 68HC12:
– 50 us (15,000 times faster)
• An enabling technology

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 45
Embedded Fuzzy
§ Single chip 68HC12 microcontroller $FF

§ Native fuzzy instructions:


degree
of
membership
slope1 slope2

– MEM; evaluate membership functions


0
0 point2 $FF
point1

n
– REV; rule evaluation: IF a is x THEN b is y
åS F
input range

– WAV; weighted averaging i i

§ Additional related instructions


system _ output = i =1
n

– MINA (place smaller of two unsigned 8-bit values in accumulator A) åF


i =1
i
– EMIND (place smaller of two unsigned 16-bit values in accumulator D)
– MAXM (place larger of two unsigned 8-bit values in memory)
– EMAXM (place larger of two unsigned 16-bit values in memory)
– TBL (table lookup and interpolate)
– ETBL (extended table lookup and interpolate)
– EMACS (extended multiply and accumulate signed 16-bit by16-bit to 32-bit)
§ Orders of magnitude faster than fuzzy routines
§ Includes: timers, PWM, A/D, Flash, RAM
§ Small and low power

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 46
68HC12: Fuzzy Innards
program system
knowledge data inputs assembler
base instructions fuzzy
inference
kernel
$FF
input
degree
member ship fuzzification
functions
of

mem
membership
slope1 slope2

0
0 point2 $FF
point1

fuzzy inputs
. . .
input range

(in RAM)

IF a is x rule list rule evalua tion


rev
THEN b is y
fuzzy outputs
. . .
(in RAM)

n
output
åS F i i member ship defuzzification
system _ output = i =1
n functions wav,ediv
åF
i =1
i

system
Motorola, Inc outputs Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 47
A Closer Look
§ Each fuzzy instruction uses an efficient memory structure to
maintain information.
– The MEM instruction uses an array of trapezoids for membership functions
and writes to an array of bytes with the calculated membership values for
each input.
– The REV instruction use a byte array of offsets and flags for the rules
antecedents and consequences and byte arrays for the outputs.
– The REVW instruction uses an array of16 bit pointers and flags for
antecedent and consequence values and a byte array for weights and
outputs
– The WAV instruction uses an byte array for outputs and singletons.

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 48
Trapezoidal Parameters

point1 point2 slope1 slope2


point1 point2 slope1 slope2
point1 point2 slope1 slope2
point1 point2 slope1 slope2
point1 point2 slope1 slope2
point1 point2 slope1 slope2
Motorola, Inc point1 point2 slope1 slope2 Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 49
Fuzzify

fuzzify: ldx #input_mfs ; point at membership functions


ldy #fuz_ins ; point at fuzzy input table
ldaa current_ins ; get first input values
ldab #7 ; 7 labels per input
loop: mem ; evaluate one membership func.
dbne b, loop ; for 7 labels of one input

X Y
point1 point2 slope1 slope2 mv1
point1 point2 slope1 slope2 mv2
point1 point2 slope1 slope2 mv3
point1 point2 slope1 slope2 mv4
point1 point2 slope1 slope2 mv5
point1 point2 slope1 slope2 mv6
point1 point2 slope1 slope2 mv7

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 50
Rule Evaluation
§ Rule evaluation is the central element of a fuzzy logic inference
program.
– Processes a list of rules from the knowledge base using current fuzzy input
values from RAM to produce a list of fuzzy outputs in RAM.
§ The CPU12 offers two variations of rule evaluation instructions:
– The REV instruction provides for unweighted rules (all rules are considered
to be equally important).
– The REVW instruction is similar but allows each rule to have a separate
weighting factor which is stored in a separate parallel data structure in the
knowledge base.
§ The fuzzy and operator corresponds to the mathematical
minimum operation and the fuzzy or operation corresponds to
the mathematical maximum operation.

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 51
Evaluate Rules

ldab #7 ; loop count


eval: clr 1,y+ ; clear a fuzzy out and inc pointer
dbne b, eval ; loop to clr all fuzzy outs
ldx #rule_start ; point at first rule element
ldy #fuz_ins ; point at fuzzy ins and outs
ldaa #$ff ; init A (and clears V-bit)
rev ; process rule list

A ins outs
rules Y
mv1 f1
X
antecedant $FE consequence $FE mv2 f2
antecedant $FE consequence $FE mv3 f3
antecedant antecedant $FE consequence $FE mv4 f4
antecedant $FE consequence $FF mv5 f5
mv6 f6
mv7 f7

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 52
De-Fuzzify

defuz: ldy #fuz_out ; point at fuzzy outputs


ldx #sgltn_pos ; point at singleton positions
ldab #7 ; 7 fuzzy outs per COG output
wav ; calculate sums for wtd av
ediv ; final divide for wtd av
tfr y,d ; move result to A:B
stab cog_out ; store system output

outs singletons n

åS F
Y X
f1 s1
f2 s2 i i
f3
f4
s3
s4 system _ output = i =1
n
f5
f6
f7
s5
s6
s7
åF
i =1
i

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 53
Program Measurements
§ Assume: 2 inputs, 1 output, 25 non-conjunctive rules and 7
membership functions on ins and out.
– Data structure costs: 160 bytes
– Program structure costs: <100 bytes
– Execution time: 75uS (response time) or about 13Khz maximum cycle time

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 54
Example: Multiple Fuzzy Controllers for a Quadruped Robot

position Balance
tilt sensors
energy Controller

limb limb limb limb


controller controller controller controller

Sensor joint Sensor joint Sensor joint Sensor joint


Actua tor controller Actua tor controller Actua tor controller Actua tor controller

Sensor joint Sensor joint Sensor joint Sensor joint


Actua tor controller Actua tor controller Actua tor controller Actua tor controller

§ hierarchy of 13 fuzzy controllers


§ standing level on uneven surfaces
§ maintaining balance dynamically

Motorola, Inc Chapman - ualberta.ca


2/22/18 EE382N-4 Class Notes 55
Reconfigurable Architectures: Tensilica LX

EE382N-4 Class Notes


Taxonomy of Reconfigurable Architectures

RECONFIGURABLE ARCHITECTURES
(R-SOC)

FINE GRAIN MULTI GRANULARITY COARSE GRAIN


(FPGA) (Heterogeneous) (Systolic)

Processor + Tile-Based
Coprocessor Architecture

Island Hierarchical Coarse Grain Fine Grain Mesh Linear Hierarchical


Topology Topology Coprocessor Coprocessor Topology Topology Topology

Xilinx Virtex Altera Stratix Chameleon Pleiades aSoC RAW Systolic Ring DART
Xilinx Spartran Altera Apex REMARC Garp E-FPFA CHESS RaPiD FPFA
Atmel AT40K Altera Cyclone Morphosys FIPSOC MATRIX PipeRench
Lattice ispXPGA Triscend E5 KressArray
Triscend A7 Systolix Pulsedsp
Xilinx Virtex-II Pro
Altera Excalibur
Atmel FPSIC
Tensislica

EE382N-4 Class Notes 57


Xtensa LX – Basic Architecture
§ Processor Configuration
– Energy Usage: 76 µW/MHz , 47 µW/MHz ( 5 and 7 stage pipeline)
– Clock Speed: 350 MHz, 400 MHz (5 and 7 stage pipeline)
– Cache:
• up to 32 KB and 1,2,3,4 way set associative cache
– 64 general purpose physical registers (32-bits)
– 6 special purpose registers
– Extensible via use of TIE and FLIX instructions
– Zero over head loops

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 58
Xtensa LX Architecture
§ 32-bit ALU
§ 1 or 2 Load/Store Model
§ Registers
– 32-bit general purpose register file
– 32-bit program counter
– 16 optional 1-bit Boolean registers
– 16 optional 32-bit floating point registers
– 4 optional 32-bit MAC16 data registers
– Optional Vectra LX DSP registers
§ General Purpose AR Register File
– 32 or 64 registers
– Instructions have access through “sliding window” of 16 registers. Window
can rotate by 4, 8, or 12 registers
– Register window reduces code size by limiting number of bits for the address
and eliminated the need to save and restore register files

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 59
Xtensa LX Architecture

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 60
Xtensa LX Pipelining

§ 5 or 7 Stage Pipeline Design:


– 5 stage pipeline has stages: IF,
Register Access, Execute, Data-
Memory Access, and register
writeback
– 5 stage pipeline accesses memory
in two stages. 7 stage pipeline is
extended version of the 5 stage
pipeline with extra IF and Memory
Access stage. Extra stages provide
more time for memory access.
Designer can run at a higher clock
speed while using slower memory
to improve performance

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 61
Xtensa LX Instruction Set

§ The Xtensa ISA consists of 80


core instructions including both
16 and 24 bit instructions

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 62
Xtensa LX ISA – Building Blocks
§ Floating Point Unit
– 32-bit, single precision, floating-point coprocessor
§ Vectra LX DSP Engine
– Optimized to handle Digital Signal Processing Applications

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 63
Vectra LX DSP Engine
§ FLIX (Flexible length instruction extension) Based
§ Vectra LX instructions encoded in 64 bits.
– Bits 0:3 of a Xtensa instruction determine its length and format, the bits
have a value of 14 to specify it is a Vectra LX instruction
– Bits 4:27 – contain either Xtensa LX core instruction or Vectra LX Load or
Store instruction
– Bits 28:45 – contains either a MAC instruction or a select instruction
– Bits 46:63 – contains either ALU and shift instructions or a load and store
instruction for the second Vectra LX load/store unit

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 64
TIE (Tensilica Instruction Extension)
§ Method used to extend the processor’s architecture and
instruction set using the TIE compiler

§ TIE Compiler
– Generates file used to configure software development tools so that they
recognize TIE Extensions
– Estimates hardware size of new instruction
– You can modify application code to take advantage of the new instruction
and simulate to decide if the speed advantage is worth the hardware cost

§ TIE Syntax
– Resembles Verilog
– More concise than RTL (it omits all sequential logic, pipeline registers, and
initialization sequences.
– The custom instructions and registers described in TIE are part of the
processor’s programming model.
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 65
TIE Queues and Ports
§ New way to communicate with external devices
§ Queues: data can be sent or read through queues. A queue is
defined in the TIE and the compiler generates the interface
signals required for the additional port needed to connect to the
queue. Logic is also automatically generated
§ Import-wire: processor can sample the value of an external signal
§ Export-state: drive an output based on TIE

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 66
TIE
§ TIE Combines multiple operations into one using:
– Fusion
– SIMD/Vector Transformation
– FLIX

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 67
Fusion
§ Allows you to combine dependent operations into a single
instruction

Consider: computing the average of two arrays


unsigned short *a, *b, *c;
. . .
for( i = 0; i < n; i++)
c[i] = (a[i] + b[i]) >> 1;

Two Xtensa LX Core instructions required, in addition to load/store


instructions

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 68
Fusion
Fuse the two operations into a single TIE instruction

operation AVERAGE{out AR res, in AR input0, in AR input1}{}{


wire [16:0] tmp = input0[15:0] + input1[15:0];
assign res = temp[16:1];
}

Essentially an add feeding a shift, described using standard


Verilog-like syntax

Implementing the instruction in C/C++


#include <xtensa/tie/average.h>
unsigned short *a, *b, *c;
. . .
for( i = 0; i < n; i++)
c[i] = AVERAGE(a[i] + b[i]);

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 69
SIMD/Vector Transformation
§ Single Instruction, Multiple Data
– Fusing instructions into a “vector”
– Allows replication of the same operation multiple times in one instruction

Consider: Computing four averages in one instruction


The following TIE code computes multiple iterations in a single instruction by
combining Fusion and SIMD

regfile VEC 64 8 v

operation VAVERAGE{out VEC res, in VEC input0, in VEC input1} {} {


wire [67:0] tmp = {input0[63:48] + input1[63:48],
input0[47:32] + input1[47:32],
input0[31:16] + input1[31:16],
input0[15:0] + input1[15:0] };
assign res = {tmp[67:52], tmp[50:35], tmp[33:18], tmp[16:1]};
}

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 70
SIMD/Vector Transformation
§ Computing four 16-bit averages
– Each data vector must be 64 bits (4 x 16 bits)
§ Create new register file, new instruction
– VEC - eight 64-bit registers to hold data vectors
– VAVERAGE - takes operands from VEC, computes average, saves results into
VEC

VEC *a, *b, *c;


for (i = 0; i < n; i += 4){
c[i] = VAVERAGE( a[i], b[i] );}

§ New data type recognized


– TIE automatically creates new load, store instructions to move 64-bit vectors
between VEC register file and memory

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 71
FLIX
§ Flexible length instruction extension
– Key in extreme extensibility
– Huge performance gains possible
– Code size reduction without code bloat
§ Similar to VLIW
§ Created by XPRES Compiler

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 72
FLIX - Usage
§ Used selectively when parallelism is needed
§ Avoids code bloat
§ Used seamlessly with standard 16- and 24-bit instructions

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 73
XPRES Compiler
§ Powerful synthesis tool
– Creates tailored processor descriptions
– Run on native C/C++ code
§ Three optimizations methods
§ Returns optimal configurations along with pros and cons
(tradeoffs)
§ Analyzes C/C++ code
§ Generates possible configurations
§ Compares performance criteria to silicon size (cost)
§ Returns possible configurations

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 74
XPRES Compiler
§ Analyzes C/C++ code
§ Generates possible configurations
§ Compares performance criteria to silicon size (cost)
§ Returns possible configurations

Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 75

You might also like