EE382N-4 Advanced Microcontroller Systems: Accelerators and Co-Processors
EE382N-4 Advanced Microcontroller Systems: Accelerators and Co-Processors
Mark McDermott
Spring 2018
Easy
Application Application
Perkowski, psu.edu
§ OS resource management
CPU
FPGA – Services (API) for allocation, partial reconfiguration,
accele-
devices rators saving and restoring the status, and monitoring
– Multiprogramming scheduler can pre-fetch
hardware accelerators in time for next use
Soft object
– Control the access to the new hardware to ensure
User level function or device driver:
Hard object trust under private or shared use
Perkowski, psu.edu
Block 6
Accelerator
§ AXI
RAM – 32 bit Bus
5
PL – Access to DRAM data &
programmable logic fabric
AXI – 1/2 CPU frequency
2 – Big penalty if bus is busy during
Slave I/C PS first attempt to access bus
DDR
1 3
4
DDR § AHB (AMBA High Speed Bus)
ARM Core
Controller
– 64 bit bus
– Runs at CPU clock frequency
Pipelined
First Access Arbitration – Access to DDR Controller to
Bus Access
provide addresses to SDRAM
Read Write Read Write
1 ARM à I/C 2 2 2 2
2 I/Cà AXI 8 8 3 3 5
3 AHB à DDRC 4 4 4 4
4 DDRC à DRAM 8 9 3 3 5
5 AXI ↔ BRAM 20 20 8 8 12
6 BRAM ↔ ACC 2 2 2 2 Perkowski, psu.edu
§ Interrupts may not work for real-time control tasks with tight
schedules
Time à
Application Operating System Hardware
open(/dev/accel); /* only once*/
Enable Accelerator AXI
Access for
…
Application
/* construct macroblocks */
macroblock = …
ARM Memory
syscall(¯oblock,
num_blocks) Data copy
…
AXI
Flush Cache Range ARM Memory
AXI DMA
Setup DMA Transfer ARM Controller
Overhead is a
major issue!
Perkowski, psu.edu
2/22/18 EE382N-4 Class Notes 21
Device Driver Access Cost
Perkowski, psu.edu
Memory
System
T
Instruction C
C
decode I/F
TCC
Register file
TCC instructions
TMS320C55x
https://www.xilinx.com/support/documentation/application_notes/xapp1170-zynq-hls.pdf
https://www.xilinx.com/support/documentation/sw_manuals/ug1046-ultrafast-design-methodology-guide.pdf
System Channel 0
Channel
Configuration Channel 1
CLK time
Development base
Support
and Test Pins
Microengine
Channel
Control Control
Data Control and
Store Data
Store
Parameter
RAM Execution
Unit
Channel 15
§ Sixteen channels
– each one connect to a MCU pin
§ Each channel has symmetric
hardware:
§ Event register
– 16-bit capture register
– 16-bit compare/match register
– 16-bit comparator
§ Pin control logic - pin direction
determined by TPU microengine
§ Executes microcoded
functions for selected
channel.
§ Returns control to
scheduler when
completed.
Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 40
Fuzzy Sets
§ In the real world, very few things belong to a single classification
and often the boundaries are not clear
§ Fuzzy sets, then, is the extension of regular (Aristotelian) sets:
– sets can overlap
– members of a set have a degree of membership instead of just belonging to
or not
§ Fuzzy sets can be given linguistic labels such as warm, positive
big, very cold, etc
§ Fuzzy logic allows the sets to be computed
– boundary conditions same as two-level logic
Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 41
Input Memberships
§ The degree to which an input belongs to a classification, is called
its membership value
§ The classifications can overlap so that an input value can belong
to more than one, hence the ability to fuzzify the input
1.0 ($FF)
degree of zero
negative positive
membership
small small
0.0 ($00)
$00 $40 $80 $C0 $FF
ra nge of values
Inputs
Error
NB NS ZE PS PB
NB NS ZE PS PB NB ZE PS PB PB PB
Change
of Error
NS NS ZE PS PS PB
ZE NB NS ZE PS PB
PS NB NS NS ZE PS
Output PB NB NB NB NS ZE
Output Adjustment
NB NS ZE PS PB
n
– REV; rule evaluation: IF a is x THEN b is y
åS F
input range
mem
membership
slope1 slope2
0
0 point2 $FF
point1
fuzzy inputs
. . .
input range
(in RAM)
n
output
åS F i i member ship defuzzification
system _ output = i =1
n functions wav,ediv
åF
i =1
i
system
Motorola, Inc outputs Chapman - ualberta.ca
2/22/18 EE382N-4 Class Notes 47
A Closer Look
§ Each fuzzy instruction uses an efficient memory structure to
maintain information.
– The MEM instruction uses an array of trapezoids for membership functions
and writes to an array of bytes with the calculated membership values for
each input.
– The REV instruction use a byte array of offsets and flags for the rules
antecedents and consequences and byte arrays for the outputs.
– The REVW instruction uses an array of16 bit pointers and flags for
antecedent and consequence values and a byte array for weights and
outputs
– The WAV instruction uses an byte array for outputs and singletons.
X Y
point1 point2 slope1 slope2 mv1
point1 point2 slope1 slope2 mv2
point1 point2 slope1 slope2 mv3
point1 point2 slope1 slope2 mv4
point1 point2 slope1 slope2 mv5
point1 point2 slope1 slope2 mv6
point1 point2 slope1 slope2 mv7
A ins outs
rules Y
mv1 f1
X
antecedant $FE consequence $FE mv2 f2
antecedant $FE consequence $FE mv3 f3
antecedant antecedant $FE consequence $FE mv4 f4
antecedant $FE consequence $FF mv5 f5
mv6 f6
mv7 f7
outs singletons n
åS F
Y X
f1 s1
f2 s2 i i
f3
f4
s3
s4 system _ output = i =1
n
f5
f6
f7
s5
s6
s7
åF
i =1
i
position Balance
tilt sensors
energy Controller
RECONFIGURABLE ARCHITECTURES
(R-SOC)
Processor + Tile-Based
Coprocessor Architecture
Xilinx Virtex Altera Stratix Chameleon Pleiades aSoC RAW Systolic Ring DART
Xilinx Spartran Altera Apex REMARC Garp E-FPFA CHESS RaPiD FPFA
Atmel AT40K Altera Cyclone Morphosys FIPSOC MATRIX PipeRench
Lattice ispXPGA Triscend E5 KressArray
Triscend A7 Systolix Pulsedsp
Xilinx Virtex-II Pro
Altera Excalibur
Atmel FPSIC
Tensislica
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 58
Xtensa LX Architecture
§ 32-bit ALU
§ 1 or 2 Load/Store Model
§ Registers
– 32-bit general purpose register file
– 32-bit program counter
– 16 optional 1-bit Boolean registers
– 16 optional 32-bit floating point registers
– 4 optional 32-bit MAC16 data registers
– Optional Vectra LX DSP registers
§ General Purpose AR Register File
– 32 or 64 registers
– Instructions have access through “sliding window” of 16 registers. Window
can rotate by 4, 8, or 12 registers
– Register window reduces code size by limiting number of bits for the address
and eliminated the need to save and restore register files
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 59
Xtensa LX Architecture
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 60
Xtensa LX Pipelining
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 61
Xtensa LX Instruction Set
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 62
Xtensa LX ISA – Building Blocks
§ Floating Point Unit
– 32-bit, single precision, floating-point coprocessor
§ Vectra LX DSP Engine
– Optimized to handle Digital Signal Processing Applications
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 63
Vectra LX DSP Engine
§ FLIX (Flexible length instruction extension) Based
§ Vectra LX instructions encoded in 64 bits.
– Bits 0:3 of a Xtensa instruction determine its length and format, the bits
have a value of 14 to specify it is a Vectra LX instruction
– Bits 4:27 – contain either Xtensa LX core instruction or Vectra LX Load or
Store instruction
– Bits 28:45 – contains either a MAC instruction or a select instruction
– Bits 46:63 – contains either ALU and shift instructions or a load and store
instruction for the second Vectra LX load/store unit
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 64
TIE (Tensilica Instruction Extension)
§ Method used to extend the processor’s architecture and
instruction set using the TIE compiler
§ TIE Compiler
– Generates file used to configure software development tools so that they
recognize TIE Extensions
– Estimates hardware size of new instruction
– You can modify application code to take advantage of the new instruction
and simulate to decide if the speed advantage is worth the hardware cost
§ TIE Syntax
– Resembles Verilog
– More concise than RTL (it omits all sequential logic, pipeline registers, and
initialization sequences.
– The custom instructions and registers described in TIE are part of the
processor’s programming model.
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 65
TIE Queues and Ports
§ New way to communicate with external devices
§ Queues: data can be sent or read through queues. A queue is
defined in the TIE and the compiler generates the interface
signals required for the additional port needed to connect to the
queue. Logic is also automatically generated
§ Import-wire: processor can sample the value of an external signal
§ Export-state: drive an output based on TIE
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 66
TIE
§ TIE Combines multiple operations into one using:
– Fusion
– SIMD/Vector Transformation
– FLIX
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 67
Fusion
§ Allows you to combine dependent operations into a single
instruction
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 68
Fusion
Fuse the two operations into a single TIE instruction
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 69
SIMD/Vector Transformation
§ Single Instruction, Multiple Data
– Fusing instructions into a “vector”
– Allows replication of the same operation multiple times in one instruction
regfile VEC 64 8 v
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 70
SIMD/Vector Transformation
§ Computing four 16-bit averages
– Each data vector must be 64 bits (4 x 16 bits)
§ Create new register file, new instruction
– VEC - eight 64-bit registers to hold data vectors
– VAVERAGE - takes operands from VEC, computes average, saves results into
VEC
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 71
FLIX
§ Flexible length instruction extension
– Key in extreme extensibility
– Huge performance gains possible
– Code size reduction without code bloat
§ Similar to VLIW
§ Created by XPRES Compiler
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 72
FLIX - Usage
§ Used selectively when parallelism is needed
§ Avoids code bloat
§ Used seamlessly with standard 16- and 24-bit instructions
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 73
XPRES Compiler
§ Powerful synthesis tool
– Creates tailored processor descriptions
– Run on native C/C++ code
§ Three optimizations methods
§ Returns optimal configurations along with pros and cons
(tradeoffs)
§ Analyzes C/C++ code
§ Generates possible configurations
§ Compares performance criteria to silicon size (cost)
§ Returns possible configurations
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 74
XPRES Compiler
§ Analyzes C/C++ code
§ Generates possible configurations
§ Compares performance criteria to silicon size (cost)
§ Returns possible configurations
Courtesy Tensilica
2/22/18 EE382N-4 Class Notes 75