COMP ENG 4TL4:
Digital Signal Processing
Notes for Lectures #31 & #32
Tuesday, November 25 &
Wednesday, November 26, 2003
8. Introduction to DSP
Architectures
4TL4 DSP
Jeff Bondy and Ian Bruce
DSP Applications
z
High volume embedded systems
z
z
z
z
z
Cell phones
Hard Drives
CD Drives
Modems
Printers
High performance data processing
z
z
z
Sonar
Wireless Basestations
Video/Data Transport
3
Resources
z
z
z
z
z
z
www.bdti.com (Started kernel speed benchmarking)
www.eembc.org (Benchmarks for almost any
application)
http://www.techonline.com/community/tech_group/dsp
(Motorola) http://ewww.motorola.com/webapp/sps/site/homepage.jsp?no
deId=06M10NcX0Fz
(TI) http://dspvillage.ti.com/
(Analog Devices)
http://www.analog.com/Analog_Root/static/technology/
dsp/beginnersGuide/index.html/
4
In ONE Cycle
z
z
z
z
z
z
z
z
Fetch instruction
FETCH
Decode instruction
DECODE
Calculate address
Fetch data
z L2 hopefully, or else increase latency by
going off chip, update L2 state
z L2 L1, update L2 and L1 state
READ
z L1 Registers
z Registers ALU
Compute instruction
EXECUTE
Write result
Update data pointers
Update instruction pointer
5
Intro to DSP Architecture
z
z
z
z
z
z
What and Why of MACs
Multiple Memory Accesses
Fast Address Generation Units
Fast Looping
Specialized Instruction Sets
Lots of I/O
Typical DSP Heart
Data Buses
Abundant Instant
Memory Access
Huge ALU Dynamic Range
FAST ALU
Chained Shifter for
repetitive calculations
Barrel Shifter
MACs Multiply Accumulates
z
In one clock cycle the ALU of a DSP can do a
multiply and addition.
z
Used in:
z
z
z
z
Vector dot products
Correlation
Filters
Fourier Transforms
In addition to ALU changes the bus structure
must also change
8
Multiple Memory Accesses
z
Complete MANY memory accesses in a
single clock cycle
z
Processor can fetch instructions while also
fetching the operands or storing to memory
z
During FIR filter can operate a multiply and
accumulate while loading the operands and coefficient
for the next cycle
Three reads and one or two writes per cycle
This requires multiple memory buses on the
same chip, not simply an address and data
bus
9
Dedicated Address Generation
z
One or more address generation units, so the
processor doesnt tie up the ALU/main data
path
z
z
z
Register indirect addressing with post-increment
Modulo addressing
Bit reversed addressing
10
Efficient looping
z
For repetitive, or branching calculations. Fornext loops in a general purpose algorithm kill
performance with calculating conditions,
checking loop logic and setting JUMPs.
z
z
z
<loop> and <repeat> instructions allow jumping to
top of loop while incrementing and testing loop
logic in a SINGLE cycle.
Delayed branching
Low~Mid range DSPs have 3~5 stage
pipelines to get rid of NOPs
11
Pipelining
None (Motorola 560xx, ie. OLD)
Fetch
Decode
Read
Execute
Fetch
Decode
Read
Execute
Pipelined (Most conventional DSP processors)
Fetch
Decode
Read
Execute
Fetch
Decode
Read
Execute
Fetch
Decode
Read
Execute
Superscalar (Pentium, MIPS)
Fetch
Decode
Read
Execute
Fetch
Decode
Read
Execute
Fetch
Decode
Read
Execute
12
Instruction Sets
z
Maximize use of underlying hardware
z
Increase instruction efficiency, complex instructions,
many different operations/accesses per call.
Minimize amount of memory used
z
Instructions must be short, restrict flexibility such as
register choice, multiple operation connections.
z
DSPs have fewer/smaller registers, use mode bits to morph
some operations, highly individualized and irregular
instructions sets.
You can compile C code into a DSP target but for
efficient code it MUST BE HAND OPTIMIZED.
13
Lots of I/O
z
Large array and amount of I/O versus
microprocessor
Specialized instruction set and hardware to
deal with fast off-chip memory access such
as DMA
14
GPP exceptions
z
General Purpose Processors have fought
back because of the huge market that DSPs
were beginning to encroach on
z
z
z
z
z
MMX (Pentium)
SSE (Pentium)
SH-2 (Strong Arm)
Power PC (AltiVec)
UltraSPARC (VIS Visual Instruction Set)
Strange? Isnt this what CRAY was saying
about vectorizing processors was the most
powerful architecture?
15
Pentium 266 MMX Versus
TMS32062x
z
z
z
z
z
z
4x More power
1/3 MIPS
1/3 256-FFT completion time
Same price
4x Die Size
Pentium needs extensive cooling
16
Modulo Addressing
Modulo addressing
z
implementing
circular buffers
and delay lines
Data-shifting
Time
Buffer contents
Next sample
n=N
xN-K+1 xN-K+1
xN-1
xN
xN+1
n=N+1
xN-K+2 xN-K+3
xN
xN+1
xN+2
n=N+2
xN-K+3 xN-K+4
xN+1
xN+2
xN+3
Time
Buffer contents
Next sample
n=N
xN-2
xN-1
xN
n=N+1
xN-2
xN-1
xN
xN+1
n=N+2
xN-2
xN-1
xN
xN+1
xN-K+1 xN-K+2
xN-K+2 xN-K+3
xN+2 xN-K+3 xxN-K+4
N-K+4
xN+1
xN+2
xN+3
17
DSP Characteristics
z
z
z
z
z
z
z
z
Arithmetic Format
Bus Width
Speed
Memory/Bus/Instruction architecture
Development Tools
Power Consumption
Cost
Specialized Hardware
18
Arithmetic
z
Fixed Point or Floating Point?
z
z
z
Fixed: numbers are integers in a set range
Float: numbers are represented by a mantissa
and exponent
Fixed: cheaper, higher volume, faster, less power,
horrible amounts of time tweaking and rescaling
at different points in a calculation. 95% of DSP
Market.
Float: Wider dynamic range, larger die size,
easier, becoming more available. 5% of DSP
Market.
19
Bus Widths
z
z
Fixed: usually 16 bit data bus
Float: 32 bit, standard IEEE mantissaexponent format
z
Motorola DSP56300 family is a widely used,
notable exception, its 24 bit fixed point.
z
Almost the defacto standard for audio processing
applications. Why? Think about the dynamic range of
the auditory system: Your ear has about 120 dB of
dynamic range.
So w/ linear, uniform coding @ 16 bits and 24 bits:
10^(120/20)/(2^16) = 15.25
10^(120/20)/(2^24) = .0595
20
Speed
z
Specmanship has inundated all aspects of
silicon specification so beware
z
z
z
z
MHz: What is the on-chip clock speed?
MIPS: Meg. Instructions Per Second, the
reciprocal of the fastest instructions time divided
by 106.
MMACS: Meg. Multiply-Accumulates per Second.
Kernel Times: For specific tasks, 256 point FIR,
Radix-2 FFT, what is the absolute time?
21
Specmanship of Speed
* www.bdti.com, Independent
DSP benchmark results for
the latest processors
22
Memory
z
Most built around fast bus architecture
z
Harvard architecture splits Address and Data buses
and memory locations (versus von Neumann)
Cache to fetch instructions freeing up bus to fetch
or write.
Embedded systems have smaller memory
needs
Variable instruction sizes and memory sizes
23
Development Tools
z
S/W Tools: assemblers, linkers simulators,
debuggers, compilers, code libraries, RTOS
z
z
z
DSPs are compiler unfriendly. Unusual and
complex instruction sets. C/Ada produce bloated
code, intricacies of number crunching almost
always coded in Assembler. Floating point
processors usually compile cleaner then Fixed
H/W Tools: emulators, development boards
JTAG: IEEE 1149.1, on chip debugging and
emulation. Scan based emulation, set
breakpoints like a S/W IDE, poll and set
registers while paused.
24
System Management
z
z
Minimizing Vcc to reduce power consumption
Sleep modes
z
z
Turn off entire sections of the chip, ie. Interface for
an unconnected protocol
Event activation with different latencies, ie. Packet
datacom, doesnt decode a packet unless device
address is pinged
Programmable on-chip clock distribution
z
z
Clock Dividers for integer differences that arise in
digital communication receivers
Phase-Locked-Loops (PLLs) for fine control over
jitter and frequency
25
COST!!
z
z
Limiting factor of any REAL design
Packaging can be 50% of real cost, product
plus manufacturing. Many companies are
going to BGA (Ball Grid Array) packs versus
P/T QFP, (Plastic/Thin Quad Flat Pack),
making them more expensive and
IMPOSSIBLE to rework.
26
Analog Devices: ADSP-2116x
SHARC
z
Has special I/O and instructions that
accelerates multiprocessor connections
z
z
6 processors strung together with bus arbitration
Any processor can access the internal memory of
any other processor
Also replicates the entire operational block,
giving you two powerful processors and
making extensive use of SIMD (more on this
later).
27
Low Range DSPs
z
Analog Devices
z
Motorola
z
DSP-560xx
Texas Instruments
z
ADSP-210x
TMS320F28x
~40 MHz Clock, usually used as a souped up
microcontroller.
Disk drives, cordless phones, ISM band
equipment
28
Mid Range DSPs
z
Analog Devices
z
Motorola
z
DSP-563xx
Texas Instruments
z
ADSP-218x
TMS320C52x
150 MHz, cell-phones, modems.
29
Very Large Instruction Word
z
z
TI TMS320c62xx First DSP
VLIW use simple, orthogonal, RISC based
instruction sets. String several 4, 8 or 16 bit
instructions together that use different parts
of the H/W to execute every cycle
Compile cleaner because of simpler
instruction sets, but hand-optimization is
harder because of heuristic scheduling for the
H/W components.
30
TMS320C62xx
One instruction is fed
into two sets of four
execution units.
Instead of the MAC-ALU
serial structure you
have them in parallel,
meaning each top-down
operation is less
complex, but may take
more instructions
31
VLIW v Superscalar
VLIW produces code AT COMPILATION that
identifies which instructions are completed in
parallel
z Superscalar hardware AT EXECUTION
identifies which instructions are completed in
parallel
!! That means that for different iterations
through a loop a different order of instructions
could be completed. Unusual processing
times
z
32
Single-Instruction Multiple
Data
z
Instead of splitting instructions, splits
operational blocks. A 16 bit MAC turns into
two 8 bit MACs.
Allows a processor to execute multiple
instances of the same operation using
different data.
33
Choose Your Own Adventure
z
z
z
z
z
What DSP code looks like
DSP Devices that you might be working with
Short introduction to DSP on video cards
MMX/SSE overview
Reading DSP spec sheets
34
FIR Filters with Assembler
MOT DSP563xx
main()
{
/* Control logic system setup and whatnot
..........................................
*/
// Begin with an assembler call
asm
{
(2)
(1)
(5)
(N)
(1)
(1)
move
move
move
move
movep
clr a
rep #N-1
mac
macr
movep
#AADDR,r0
// Register r0 load, will contain coeffs
#BADDR,r4
// Register r4 load, will contain data
#N-1,m4
// Load loop control
m4,m0
// move loop control
y:input,y:(r4)
// move peripheral data from Input "y"
x:(r0)+,x0 y:(r4)-,y0
// clear accumulator, memory moves
// Repeat next instruction
x0,y0,a x:(r0)+,x0 y:(r4)-,y0 // Multiply Accumulate, update registers
c0,y0,a (r4)+
// Rounding and scaling (set by c0)
a,y:output
// move accumulator output to peripheral "y"
}
// End assembler call
/* Control logic system setup and whatnot
..........................................
*/
}
35
Differences in Assembler
codes
main:
bits
lda
lda
lda
mov
mov
mov
mov
add
add
add
mov
bits
mov
mov
%fmode, 2
/* Enable Q15 */
r13, Xdata
r15, Dbuffer
r11, Yout
r10, 40
/* Filter size, Nlen = 40 $$$ */
r9, 200 /* Input data size (Nsamp = 200) $$$ */
%cb1_beg, r15
r8, r10
/* r8 = Nlen */
r8, 1
/* r8 = Nlen+1 */
r10, -1
/* Adjust for loop counter */
r8, r15
%cb1_end, r8
/* CB size = Nlen+1 */
%smode, 2
/* Enable CB1 (for r15) */
r6, 10000
%timer0, r6
/* Initialize Timer count */
/* Worst case cycle count = */
/* (Nlen + 6)*Nsamp */
per_sample:
ldu r7, r13, 1 /* "Acquire" new sample from "Xdata",*/
/* a pre-stored input buffer -- in a */
/* real-time application, this new */
/* sample may come from a different */
/* task or an external device, etc. */
mov %loop0, r10
lda r14, Hfilter
psub.a r0, r0
/* Clear accumulator's 32-bits */
st r7, r15
/* Store new sample into Dbuffer */
mov %guard, 0
/* Clear Guard bits */
bits %tc, 7
/* Timer0 starts ticking */
fir_loop:
ldu r4, r14, 1
/* Filter coefficient */
ldu r2, r15, 1
/* Sample from Data buffer (circular) */
mac.a r2, r4
agn0 fir_loop
bitc %tc, 7
/* Timer0 frozen */
round.e r0, r0
/* Filter output is rounded */
stu r1, r11, 1
/* Filter output is stored */
flag1:
nop
add r9, -1
bnz per_sample
nop
filter_done:
/* Set an SDBUG break-point here */
nop
/* Note: ZSIM or RTL need a HALT here */
nop
br filter_done
nop
This is from the LSI website, and in
my mind, one of the reasons why
they have lost some market share
36
Analog Devices Overview
CHIPS
Vendor
Analog
Devices
Family
Floating,
Fixed, or
Both
ADSP218x
Fixed
point
ADSP219x
Fixed
point
16 bits
ADSP2116x
(SHARC)
Floating
point
32/40
bits
ADSPBF53x
(Blackfin)
ADSPTS20x
(TigerSH
ARC)
Fixed
point
Both
Data Width
16 bits
16 bits
8/16/32/4
0 bits
Instruction
Width
24 bits
24 bits
48 bits
16/32
bits
32 bits
Core Clock
Speed [1]
80 MHz
160 MHz
100 MHz
600 MHz
600 MHz
BDTImark2
000
BDTIsimMar
k2000 [2]
Total OnChip
Memory,
Bytes
240
20 K
256 K
410
20 K
160 K
470
128 K
512 K
3360 [5]
6150 [5]
84 K
308 K
512 K
3M
Core
Voltage
1.8
2.5
1.8, 2.5
0.71.2,
1.01.6
1.0, 1.2
* From http://www.bdti.com
Unit Price
[3]
Notes
$424
Many family
members w/
assorted
peripherals
$1024
Enhanced
version of
the ADSP218x
$2299
Features
SIMD,
strong
multiprocess
or support
$635
$35299
Dual-MAC
DSP with
variable
speed and
voltage
4-way VLIW
with SIMD
capabilities;
uses
eDRAM
37
Motorola Devices Overview
CHIPS
Vendor
Family
Floating,
Fixed, or
Both
DSP56
3xx
Fixed
point
Data Width
24 bits
Instruction
Width
Core Clock
Speed [1]
24 bits
240
MHz
BDTImark2
000
BDTIsimMar
k2000 [2]
Total OnChip
Memory,
Bytes
Core
Voltage
710
24 K
384 K
1.5, 1.6,
1.8, 3.3
Unit Price
[3]
Notes
$456
PCI bus,
DMA, can
run 560xx
code
unmodified
DSP56
8xx
Fixed
point
16 bits
16 bits
40 MHz
[6]
110
28 K
152 K
2.5, 3.3
$315
Contains
many
microcontrol
ler-like
features
DSP56
85x
Fixed
point
16 bits
16 bits
120
MHz
340
36 M
1.8
$612
Enhanced
version of
the 568xx
Motorola
MSC81
0x
(SC140
)
Fixed
point
16 bits
16 bits
300
MHz
3370 [7]
512 K
1436 K
1.6
* From http://www.bdti.com
$90
195
Based on
quad-MAC
SC140 core;
8102 uses
4 cores
38
TI Devices Overview
* From http://www.bdti.com
CHIPS
Vendor
TI
Family
Floating,
Fixed, or
Both
TMS320
F24x
Fixed
point
TMS320
F28x
Fixed
point
TMS320
C3x
Floating
point
TMS320
C54x
Fixed
point
Data Width
Instruction
Width
Core Clock
Speed [1]
BDTImark2
000
BDTIsimMar
k2000 [2]
Total OnChip
Memory,
Bytes
Core
Voltage
Unit Price
[3]
Notes
16 bits
16/32
bits
40 MHz
n/a
18 K
1120 K
3.3, 5.0
$315
Hybrid
microcontrolle
r/DSP
32 bits
16/32
bits
150 MHz
n/a
164 K
292 K
$1618
Hybrid
microcontrolle
r/DSP;
compatible w/
C24x
32 bits
32 bits
75 MHz
[6]
n/a
264 K
2304 K
3.3, 5.0
$10213
Costcompetitive
with fixed
point DSPs
16 bits
16 bits
160 MHz
500
24 K
1280 K
1.5, 1.6,
1.8, 2.5,
3.3
$4109
Many
specialized
instructions
1.8
TMS320
C55x
Fixed
point
16 bits
848 bits
300 MHz
1460
80 K
376 K
1.26, 1.5,
1.6
$520
Next
generation
C5xxx
architecture;
dual-issue,
dual-MAC
DSP
TMS320
C62x
Fixed
point
16 bits
32 bits
300 MHz
1920
72 K
896 K
1.5, 1.8
$9102
8-way VLIW
TMS320
C64x
Fixed
point
8/16 bits
32 bits
720 MHz
6570
288 K
1056 K
1.0, 1.2,
1.4
$39277
Next
generation
C6xxx
architecture
TMS320
C67x
Floating
point
32 bits
32 bits
225 MHz
1100
64 K
264 K
1.2, 1.26,
1.8, 1.9
$14110
Floating point
version of
C62x
39
Cores versus Chips
40
NVidia NV3x Video Card Core NVIDIA GEFORCE FX 5900
Cut input into little quads
Interpolater
Programmable DSP Core
Different Units for different
processes
Fusing and smoothing
41
NV3x Guts
42
MMX versus SSE
z
MMX: 51 New processor instructions for Pentium II
z
z
z
MMX = MultiMedia eXtensions
SIMD for integers
MMX instructions operate on two 32-bit integers
simultaneously
SSE: 70 New processor instructions and subtle
architecture differences for the Pentium III and later
z
z
z
z
SSE = Streaming SIMD extensions
Pentium III introduction did not follow Moores law on clock
speed, but on most operations because of it
SIMD for single-precision floating-point numbers
SSE instructions operate on four 32-bit floats
simultaneously.
43
SSE Architecture Changes
z
New registers, each is 128 bits long and can
hold four single-precision (32 bit) floatingpoint numbers
44
SSE Advantages
z
An application cannot execute MMX instructions
and perform floating-point operations
simultaneously.
Operations accelerated with SSE instructions
are matrix multiplication, matrix transposition,
matrix-matrix operations like addition,
subtraction, and multiplication, matrix-vector
multiplication, vector normalization, vector dot
product, and lighting calculations.
45
MMX Benchmark
Deependra Talla and Lizy K. John (1999) Performance Evaluation and
Benchmarking of Native Signal Processing European Conference on Parallel
Processing
46
ADSP-TS20x TigerSHARC
VLIW and SIMD:
Split one instruction
between two units (VLIW),
and each of those units
can split their part of the
instruction into sub units.
In this example we can see
one uber-instruction can
call 8 16-bit multiplies.
* Walkthrough of ADSP-TS201 Spec Sheet
47
Motorola DSP56367
z
Walkthrough of SPECSHEET
48
Texas Instruments
TMS320VC5421
z
Spec Sheet Walkthrough
49