INTRODUCTION TO
DIGITAL SIGNAL
PROCESSORS
Accumulator architecture
Memory-register architecture
Prof. Brian L. Evans
in collaboration with
Niranjan Damera-Venkata and
Magesh Valliappan
Embedded Signal Processing Laboratory
The University of Texas at Austin
Austin, TX 78712-1084
http://signal.ece.utexas.edu/
Load-store architecture
Outline
Signal processing applications
Conventional DSP architecture
Pipelining in DSP processors
RISC vs. DSP processor architectures
TI TMS320C6x VLIW DSP architecture
Signal and image processing applications
Signal processing on general-purpose processors
Conclusion
2
Signal Processing Applications
Low-cost embedded systems
Modems, cellular telephones, disk drives, printers
High-throughput applications
Halftoning, base stations, 3-D sonar, tomography
PC based multimedia
Compression/decompression of audio, graphics, video
Embedded processor requirements
Inexpensive with small area and volume
Deterministic interrupt service routine latency
Low power: ~50 mW (TMS320C54x uses 0.36 mA/MIP)
3
Conventional DSP Architecture
Harvard architecture
Separate data memory/bus and program memory/bus
Three reads and one or two writes per instruction cycle
Deterministic interrupt service routine latency
Multiply-accumulate in single instruction cycle
Special addressing modes supported in hardware
Modulo addressing for circular buffers (e.g. FIR filters)
Bit-reversed addressing (e.g. fast Fourier transforms)
Instructions to keep the pipeline (3-4 stages) full
Zero-overhead looping (one pipeline flush to set up)
Delayed branches
4
Conventional DSP Architecture (cont)
Data-shifting
Modulo
addressing
implementing
circular buffers
and delay lines
Time
Buffer contents
Next sample
n=N
xN-K+1
xN-K+1
xN-1
xN
xN+1
n=N+1
xN-K+2
xN-K+3
xN
xN+1
xN+2
n=N+2
xN-K+3
xN-K+4
xN+1
xN+2
xN+3
Modulo addressing
Bit reversed
addressing
used to
implement
the radix-2
FFT
Time
Next sample
Buffer contents
n=N
xN-2
xN-1
xN
xN-K+1
n=N+1
xN-2
xN-1
xN
xN+1
n=N+2
xN-2
xN-1
xNN
xN+1
xN-K+2
xN+1
xN-K+2 xN-K+3
xN+2
xN+2
xN-K+3 xxN-K+4
N-K+4
xN+3
Conventional DSP Architecture (cont)
Cost /U n i t
Ar ch i t ect u r e
R eg i st er s
D a t a Wor d s
O n -Ch i p
Mem or y
Ad d r es s
S p a ce
Com p i l er s
Exa m p l es
Fi xed -P oi n t
$5 - $79
Accu m u la t or
2-4 da t a
8 a ddr ess
16 or 24 bit in t eger
a n d fixed-poin t
2-64 kwor ds da t a
2-64 kwor ds pr ogr a m
16-128 kw da t a
16-64 kw pr ogr a m
C com piler s;
poor code gen er a t ion
TI TMS320C5x;
Mot or ola 56000
Fl oa t i n g -P oi n t
$5 - $381
loa d-st or e or
m em or y-r egist er
8 or 16 da t a
8 or 16 a ddr ess
32 bit in t eger a n d
fixed/floa t in g-poin t
8-64 kwor ds da t a
8-64 kwor ds pr ogr a m
16 Mw 4Gw da t a
16 Mw 4 Gw pr ogr a m
C, C++ com piler s;
bet t er code gen er a t ion
TI TMS320C3x;
An a log Devices SH ARC
6
Conventional DSP Architecture (cont)
Market share: 95% fixed-point, 5% floating-point
Each processor family has dozens of members with
different on-chip configurations
Size and map of data and program memory
A/D, input/output buffers, interfaces, timers, and D/A
Drawbacks to conventional DSP processors
No byte addressing (needed for image and video)
Limited on-chip memory
Limited addressable memory on fixed-point DSPs, except
Motorola 56300 (16 Mw data; 64 Mw program)
Non-standard C extensions to support fixed-point data
7
Pipelining
Sequential (Motorola 56000)
Fetch
Decode
Read
Execute
Pipelined (Most conventional DSP processors)
Fetch
Decode
Read
Execute
Superscalar (Pentium, MIPS)
Managing Pipelines
compiler or programmer
Fetch
Decode
Read
Execute
Superpipelined (CDC7600)
Fetch
Decode
Read
pipeline interlocking
in the processor
hardware instruction
scheduling
Execute
Pipelining: Operation
Time-stationary pipeline model
Fetch
Programmer controls each cycle
Motorola DSP56001
MAC X0,Y0,A
X:(R0)+,X0 Y:(R4)-,Y0
Data-stationary pipeline model
Programmer specifies data operations
TMS320C30/40
MPYF *++AR0(1),*++AR1(IR0),R0
Interlocked pipeline
Programmer is protected from pipeline
effects
F
D
E
F
G
H
I
J
K
L
L
Decode
Read
Execute
D
C
D
E
F
G
H
I
J
K
L
E
A
B
C
D
E
F
G
H
I
J
K
L
R
B
C
D
E
F
G
H
I
J
K
L
Pipelining: Hazards
A control hazard occurs when a
branch instruction is decoded
Flush the pipeline
or: Delayed branch (expose pipeline)
A data hazard occurs because
an operand cannot be read yet
Intended by programmer
or: Interlock hardware inserts bubble
TMS320C5x example
LAC #064h
SAMM AR2
NOP
LACC *-
LAR AR2, DATA
LACC *-
Fetch
Decode
Read
Execute
F D R E
D C B A
E D C B
F E D C
br F E D
G br F E
- - br F
- - - br
X - - Y X - Y - X Z Y - X
Z Y Z Y
Z
10
Pipelining: Avoiding Control Hazards
Fetch
A key factor in the numeric performance
of DSPs is the provision of special
hardware to perform looping.
RPT COUNT
TBLR *+
Decode
Execute
F
D
E
F
rpt
A repeat instruction repeats one
instruction or a block of
instructions after repeat
The pipeline is filled with
repeated instruction (or block of
instructions)
Cost: one pipeline flush only
Read
X
X
X
X
X
X
X
X
D
C
D
E
F
rpt
X
X
X
X
X
R
B
C
D
E
F
rpt
X
X
X
X
E
A
BC
D
E
F
rpt
X
X
X
11
RISC vs. DSP: Instruction Encoding
RISC: Superscalar
Reorder
Load/store
FP Unit
Integer Unit
DSP: Horizontal microcode
Load/store
Load/store
ALU
Multiplier
Address
12
RISC vs. DSP: Memory Hierarchy
RISC
Registers
Out
of
order
I/D
Cache
Physical
memory
TLB
TLB: Translation Lookaside Buffer
I Cache
DSP
Internal
memories
Registers
External
memories
DMA Controller
DMA: Direct Memory Access
13
TI TMS320C6x VLIW DSP Architecture
Simplified
Architecture
Program RAM
or Cache
Data RAM
Addr
Internal Buses
DMA
Data
.D2
.M1
.M2
.L1
.L2
.S1
.S2
Regs (B0-B15)
Regs (A0-A15)
External
Memory
-Sync
-Async
.D1
Serial Port
Host Port
Boot Load
Timers
Control Regs
Pwr Down
CPU
14
TI TMS320C6x VLIW DSP Architecture
Two parallel data paths with single-cycle units:
Data unit - 32-bit address calculations (modulo, linear)
Multiplier unit - 16 bit x 16 bit with 32-bit result
Logical unit - 40-bit (saturation) arithmetic & compares
Shifter unit - 32-bit integer ALU and 40-bit shifter
16 32-bit registers in each data path
40 bits can be stored in adjacent even/odd registers
Fixed-point (C62x) and floating-point (C67x)
TMS320C6201: $25 in volume
150 MHz, 300 million MACs/sec, 1200 RISC MIPS
On-chip memory: 16 k x 32 program, 32 k x 16 data
15
TI TMS320C6x VLIW DSP Architecture
One instruction cycle every clock cycle
Deep pipeline
7-11 stages in C62x: fetch 4, decode 2, execute 1-5
7-16 stages in C67x: fetch 4, decode 2, execute 1-10
If a branch is in the pipeline, interrupts are disabled (the latency
of a branch is 5 cycles)
Avoid branches by using conditional execution
No hardware protection against pipeline hazards
Compiler and assembler must prevent pipeline hazards
C67x computes floating-point multiply in 4 cycles
16
C5x and C6x Addressing Modes
Immediate
The operand is part of the
instruction
ADD #0FFh
add .L1 -13,A1,A6
(implied)
add .L1 A7,A6,A7
ADD 010h
not supported
ADD *
ldw .L1 *A5++[8],A1
Direct
The address of the
operand is part of the
instruction (added to
imply memory page)
TMS320C6x
Register
The operand is specified
in a register
TMS320C5x
Indirect
The address of the
operand is stored in a
register
17
TMS320C6x vs. Pentium MMX
P r ocessor
P ea k BD T I
IS R
P ow er U n i t
MIP S m a r k s l a t en cy
P r i ce
Ar ea
Vol u m e
P en t iu m
MMX 233
466
49
1.14 ms
4.25 W
$213 5.5 x 2.5 8.789 in 3
P en t iu m
MMX 266
532
56
1.00 ms
4.85 W
$348 5.5 x 2.5 8.789 in 3
C62x
150 MH z
1200
74
0.12 ms
1.45 W
$25 1.3 x 1.3 0.118 in 3
C62x
200 MH z
1600
99
0.09 ms
1.94 W
$96 1.3 x 1.3 0.118 in 3
BDTImarks: Berkeley Design Technology Inc. DSP benchmark
results (larger means better) http://www.bdti.com/bdtimark/results.htm
http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/processors.html
18
Application: FIR Filter
Each tap requires
z-1
z-1
z-1
Fetching one data sample
Fetching one operand
Multiplying two numbers
Accumulating multiplication result
Shifting one sample in the delay line
Computing an FIR tap in one instruction cycle
Three data memory accesses
Auto-increment or decrement addressing modes
Modulo addressing to implement delay line as circular buffer
19
Application: FIR Filter on a TMS320C5x
Coefficients
Data
COEFFP .set 02000h
X
.set 037Fh
LASTAP .set 037FH
LAR AR3, #LASTAP
RPT #127
MACD COEFFP, *APAC
SACH Y,1
; Program mem address
; Newest data sample
; Oldest data sample
; Point to oldest sample
; Do the thing
; Store result -- note shift
20
Application: FIR Filter on a TMS320C62x
Coefficients
Data
Single-Cycle Loop
...
C7:
||
|| [B0]
|| [B0]
||
||
ldh
ldh
sub
B
mpy
add
.D1 *A1++, A2
.D2 *B1++, B2
.L2 B0, 1, B0
.S2 c7
.M1x A2, B2, A3
.L1 A4, A3, A4
;
;
;
;
;
;
Read coefficient
Read data
Decrement counter
Branch if not zero
Form product
Accumulate result
...
21
Ordered Dithering on a TMS320C62x
periodic
array of
thresholds
1/8
5/8
7/8
3/8
7/8
3/8
1/8
5/8
Throughput of two cycles
; remove next two lines if thresholds in linear array
MVK
.S1 0x0001,AMR
; modulo block size 2^2
MVKH
.S1 0x4000,AMR
; modulo addr reg B6
; initialize A6 and B6
.trip 100
; minimum loop count
dith: LDB
.D1 *A6++,A4
; read pixel
||
LDB
.D2 *B6++,B4
; read threshold
||
CMPGTU .L1x A4,B4,A1
; threshold pixel
||
ZERO
.S1 A5
; 0 if <= threshold
[A1] MVK
.S1 255,A5
; 255 if > threshold
||
STB
.D1 A5,*A6++
; store result
||[B0] SUB
.L2 B0,1,B0
; decrement counter
||[B0] B
.S2 dith
; branch if not zero
22
DSP Cores
ASIC with:
Programmable DSP
RAM
ROM
Standard cells
Codec
Peripherals
Gate array
Microcontroller
23
DSP on General Purpose Processors
Multimedia applications on PCs
Video, audio, graphics and animation
Repetitive parallel sequences of instructions
Native signal processing examples
Sun Visual Instruction Set (UltraSPARC 1/2)
Intel MMX (Pentium I/II/III)
Intel Concurrent SIMD-FP (Pentium III)
Single Instruction Multiple Data (SIMD)
One instruction acts on multiple data in parallel
Well-suited for graphics
24
DSP on General Purpose Processors (cont)
Programming is considerably tougher
C/C++ compilers do not generate native signal processing code
except Metrowerks CodeWarrior 5 gives MMX code
Libraries of routines using native signal processing
Hand code using in-line assembly for best performance
Pack/unpack data not aligned on SIMD word boundaries
50-cycle penalty to switch to MMX; 0 penalty for VIS
Saturation arithmetic in MMX; not supported in VIS
Extended-precision accumulation in MMX; none in VIS
Speedup for applications
Signal and image processing - 1.5:1 to 2:1
Graphics - 4:1 to 6:1 (no packing/unpacking)
25
Intel MMX Instruction Set
64-bit SIMD register (4 data types)
64-bit quad word
Packed byte (8 bytes packed into 64 bits)
Packed word (4 16-bit words packed into 64 bits)
Packed double word (2 double words packed into 64 bits)
57 new instructions
Pack and unpack
Add, subtract, multiply, and multiply/accumulate
Saturation and wraparound arithmetic
Maximum parallelism possible
8:1 for 8-bit additions
4:1 for 8 x 16 multiplication or 16-bit additions
26
Concluding Remarks
Conventional digital signal processors
High performance vs. power consumption/cost/volume
Excel at one-dimensional processing
Per cycle: 1 16x16 MAC & 4 16-bit RISC instructions
TMS320C6x VLIW DSP
High performance vs. cost/volume
Excel at multidimensional signal processing
Per cycle: 2 16x16 MACs & 4 32-bit RISC instructions
Native Signal Processing
Available on desktop computers
Excels at graphics
Per cycle: 2 8x16 MACs OR 8 8-bit RISC instructions
In-line assembly code for best performance
27
Concluding Remarks
Digital signal processor market
40% annual growth rate since 1990
$3.5 billion revenue in 1998
45% TI, 25% Lucent, 10% Motorola, 8% Analog Devices
Independent benchmarking by industry
Berkeley Design Technology Inc. http://www.bdti.com
EDN Embedded Microprocessor Benchmark Consortium
http://www.eembc.org
Web resources
comp.dsp newsgroup: FAQ www.bdti.com/faq/dsp_faq.html
embedded processors and systems: www.eg3.com
on-line courses and DSP boards: www.techonline.com
28
References
G. E. Allen, B. L. Evans, and D. C. Schanbacher, Real-Time Sonar Beamforming on
a Unix Workstation, Proc. IEEE Asilomar Conf. On Signals, Systems, and
Computers, pp. 764-768, 1998.
http://www.ece.utexas.edu/~bevans/papers/1998/beamforming/
R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, Evaluating MMX
Technology Using DSP and Multimedia Applications, Proc. IEEE Sym. On
Microarchitecture, pp. 37-46, 1998.
http://www.ece.utexas.edu/~ravib/mmxdsp/
W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, Native Signal Processing on the
UltraSPARC in the Ptolemy Environment, Proc. IEEE Asilomar Conf. On Signals,
Systems, and Computers, 1996.
http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/21_nsp/vis/
B. L. Evans, EE379K-17 Real-Time DSP Laboratory, UT Austin.
http://www.ece.utexas.edu/~bevans/courses/realtime/
B. L. Evans, EE382C Embedded Software Systems, UT Austin.
http://www.ece.utexas.edu/~bevans/courses/ee382c/
A. Kulkarni and A. Dube, Evaluation of the Code Generation Domain in Ptolemy,
http://www.ece.utexas.edu/~bevans/talks/benchmarking97/sld001.htm
P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals, IEEE
Press, 1997.
29