INTRODUCTION TO A ccu m u l a t o r a r c h i t e c t u re
DIGITAL SIGNAL
PROCESSORS
M em o r y - r e g i s t e r a r c h itectu r e
P r of. B r i a n L . E v a n s
in collaboration w ith
N ir a n ja n D a m e r a -V e n k a t a a n d
M a g e s h Va llia p p a n L o a d - s t o r e a r c h itectu r e
E m b e d d e d S ign a l P r oces s in g L a b or a t or y
T h e U n i v e r s it y of T e x a s a t A u s t in
A u s t in , T X 7 8 7 1 2 - 1 0 8 4
h t t p ://s i g n a l.e c e .u t e x a s .e d u /
Outline
n S ign a l p r o c e s s i n g a p p lica t ion s
n C o n v e n t ion a l D S P a r ch it e ct u r e
n P i p e l i n in g i n D S P p r oce s s or s
n R I S C v s . D S P p r oce s s or a r ch it e ct u r e s
n T I T M S 3 2 0 C 6 x V L I W D S P a r ch it e ct u r e
n S ign a l a n d im a g e p r oce s s in g a p p lica t ion s
n S ign a l p r o c e s s i n g on g e n e r a l-p u r p o s e p r o c e s s o r s
n C o n clu s ion
2
Signal Processing Applications
n Low-cost em b e d d e d s y s t e m s
4 M o d e m s , cellu la r t e l e p h on e s , d i s k d r i v e s , p r i n t e r s
n H igh -t h r ou g h p u t a p p lica t ion s
4 H a lft on in g , ba s e s t a t ion s , 3 - D s o n a r , t o m ogr a p h y
n P C b a s e d m u lt im e d ia
4 C o m p r e s s ion /d e com p r e s s ion of a u d io, gr a p h ics , vid e o
n E m b e d d e d p r oce s s or r e q u ir e m e n t s
4 I n e x p e n s i v e w i t h s m a ll a r e a a n d volu m e
4 D e t e r m in i s t i c i n t e r r u p t s e r v ice r ou t in e l a t e n cy
4 L o w p o w e r : ~ 5 0 m W ( T M S 3 2 0 C 5 4 x u s e s 0 . 3 6 A / M I P )
3
Conventional DSP Architecture
n H a r v a r d a r ch it e ct u r e
4 S e p a r a t e d a t a m e m or y /b u s a n d p r ogr a m m e m or y /bu s
4 T h r e e r e a d s a n d on e or t w o w r i t e s p e r i n s t r u ct ion cycle
n D e t e r m in istic in t e r r u p t s e r v ice r ou t in e la t e n cy
n M u lt iply-a ccu m u la t e in s in g l e i n s t r u c t i o n c y c l e
n S p e cia l a d d r e s s in g m o d e s s u p p o r t e d i n h a r d w a r e
4 M o d u l o a d d r e s s i n g for cir cu la r b u ffe r s (e .g. F I R filt e r s )
4 B it -r e v e r s e d a d d r e s s in g (e.g. fa s t F ou r i e r t r a n s for m s )
n I n s t r u ct ion s t o k e e p t h e p i p e l i n e ( 3 - 4 s t a g e s ) f u l l
4 Z e r o-ove r h e a d loop in g (on e p i p e l i n e flu s h t o s e t u p )
4 D e la y e d b r a n ch e s
4
Conventional DSP Architecture (cont)
Data-shifting
n Modulo addressing Time Buffer contents Next sample
4 implementing n=N xN-K+1 xN-K+1 xN-1 xN xN+1
circular buffers and
n=N+1 xN-K+2 xN-K+3 xN xN+1 xN+2
delay lines
n=N+2 xN-K+3 xN-K+4 xN+1 xN+2 xN+3
Modulo addressing
Time Buffer contents Next sample
n Bit reversed
n=N xN-2 xN-1 xN xN-K+1 xN-K+2 xN+1
addressing
4 used to n=N+1 xN-2 xN-1 xN xN+1 xN-K+2 xN-K+3 xN+2
implement the
radix-2 FFT n=N+2 xN-2 xN-1 xNN xN+1 xN+2 xN-K+3 xxN-K+4
N-K+4 xN+3
5
Conventional DSP Architecture (cont)
F i x e d -P o i n t F l o a t i n g -P o i n t
C o s t /U n i t $5 - $79 $5 - $381
Architecture Accu m u la t or loa d -s t or e or
m e m or y -r e g ist e r
Registers 2-4 data 8 or 1 6 d a t a
8 address 8 or 1 6 a d d r e s s
Data Word s 1 6 o r 2 4 b it in t e g e r 32 bit in t e g e r a n d
a n d fixed -poin t fixed /floa t in g -poin t
On-Chip 2 -64 k w or d s d a t a 8 -64 k w or d s d a t a
Mem ory 2 -64 k w or d s p r ogr a m 8 -64 k w or d s p r ogr a m
A d d r ess 16-128 k w d a t a 16 Mw 4Gw data
S p a ce 1 6 -64 k w p r ogr a m 1 6 M w 4 G w p r ogr a m
Com p ilers C com p iler s ; C , C++ com p iler s ;
p o o r c o d e g e n e r a t ion b e t t e r cod e g e n e r a t ion
Exa m p les TI TMS320C5x; TI TMS320C3x;
Motorola 56000 An a log D e v ice s S H A R C
6
Conventional DSP Architecture (cont)
n M a r k e t s h a r e : 9 5 % fixed -p o i n t , 5% floa t in g -poin t
n E a ch p r oce s s or fa m ily h a s d ozen s of m e m b e r s
w it h d iffe r e n t on -ch ip con fig u r a t ion s
4 S ize a n d m a p of d a t a a n d p r ogr a m m e m or y
4 A/D, in p u t /ou t p u t b u ffe r s , in t e r fa ces , t im e r s , a n d D /A
n D r a w b a ck s t o con v e n t ion a l D S P p r oce s s or s
4 N o b y t e a d d r e s s in g (n e e d e d for i m a g e a n d v i d e o )
4 L im it e d on -ch i p m e m o r y
4 L im it e d a d d r e s s a b l e m e m o r y o n f i x e d - p o i n t D S P s ,
e x c e p t M ot o r o l a 5 6 3 0 0 ( 1 6 M w d a t a ; 6 4 M w p r ogr a m )
4 N on -s t a n d a r d C e x t e n s ion s t o s u p p or t fixed -poin t d a t a
7
Pipelining
Sequential (Motorola 56000)
Fetch Decode Read Execute
Pipelined (Most conventional DSP processors)
Fetch Decode Read Execute
Superscalar (Pentium, MIPS)
Managing Pipelines
compiler or programmer
pipeline interlocking
Fetch Decode Read Execute
in the processor
Superpipelined (CDC7600)
hardware instruction
scheduling
Fetch Decode Read Execute
8
Pipelining: Operation
Fetch Decode Read
n T im e -s t a t ion a r y p i p e l i n e m o d e l Execute
4 P r ogr a m m e r con t r ols e a ch cycle F D R E
4 M ot or o l a D S P 5 6 0 0 1 D C B A
E D C B
MAC X0,Y0,A X:(R0)+,X0 Y:(R4)-,Y0
F E D C
G F E D
n D a t a -s t a t ion a r y p i p e l i n e m o d e l
H G F E
4 P r ogr a m m e r s p e cifies d a t a I H G F
operations
J I H G
4 T M S 3 2 0 C 3 0 /40 K J I H
L K J I
MPYF *++AR0(1),*++AR1(IR0),R0
L - K J
L - K
n I n t e r lock e d p i p e l i n e L -
4 P r ogr a m m e r i s p r ot e ct e d fr o m L
p i p e l i n e e ffe c t s
9
Pipelining: Hazards
Fetch Decode Read
n A con t r ol h a z a r d occu r s w h e n a Execute
b r a n ch in s t r u ct ion i s d e cod e d F D R E
4 F lu s h t h e p i p e l i n e D C B A
4 or : D e l a y e d b r a n c h ( e x p o s e p i p e l i n e ) E D C B
F E D C
n A d a t a h a z a r d occu r s b e ca u s e
br F E D
a n o p e r a n d ca n n ot b e r e a d y e t G br F E
4 I n t e n d e d b y p r ogr a m m e r - - br F
4 or : I n t e r lock h a r d w a r e in s e r t s - - - br
b u b b l e X - - -
TMS320C5x example
Y X - -
Y - X -
LAC #064h LAR AR2, DATA
Z Y - X
SAMM AR2 LACC *-
Z Y -
NOP
LACC *- Z Y
Z
10
Pipelining: Avoiding Control Hazards
Fetch Decode Read
Execute
A key factor in the numeric performance
of DSPs is the provision of special F D R E
hardware to perform looping.
D C
B
RPT COUNT E D AB
TBLR *+ CD
F E CD
E
rpt F E
F
n A r e p e a t in s t r u ct ion r e p e a t s X rpt F
rpt
X - rpt
on e in s t r u ct ion or a block of -
X - -
in s t r u ct ion s a ft e r r e p e a t -
X X -
X
n T h e p i p e l i n e is filled w it h X X X
X
X X X
r e p e a t e d in s t r u ct ion (or X
X X X
X
b lock of in s t r u ct ion s ) X X
n C o s t : on e p i p e l i n e flu s h on ly
11
RISC vs. DSP: Instruction Encoding
n R I S C : S u p e r s ca la r
Reorder
Load/store
FP Unit Integer Unit
n DSP: Horizontal microcode
Load/store
Load/store
Address
ALU Multiplier
12
RISC vs. DSP: Memory Hierarchy
n RISC
Registers
I/D Physical
Cache memory
Out
of TLB
order
TLB: Translation Lookaside Buffer
I Cache Internal
n DSP memories
Registers
External
memories
DMA Controller DMA: Direct Memory Access
13
TI TMS320C6x VLIW DSP Architecture
Simplified
Architecture
Program RAM
Data RAM
or Cache
Addr
Internal Buses
DMA
Data
.D1 .D2 Serial Port
Regs (A0-A15)
Regs (B0-B15)
External .M1 .M2 Host Port
Memory
-Sync .L1 .L2
Boot Load
-Async
.S1 .S2
Timers
Control Regs
Pwr Down
CPU
14
TI TMS320C6x VLIW DSP Architecture
n T w o p a r a lle l d a t a p a t h s w it h s ingle-cycle u n it s :
4 D a t a u n it - 3 2 -bit a d d r e s s ca lcu la t ion s (m o d u lo, lin e a r )
4 M u lt i p l i e r u n it - 1 6 b i t x 1 6 b i t w i t h 3 2 -bit r e s u lt
4 Logica l u n it - 4 0 -bit (s a t u r a t ion ) a r it h m e t ic & com p a r e s
4 S h ift e r u n it - 3 2 - b i t i n t e g e r A L U a n d 4 0 -bit s h ift e r
n 1 6 3 2 -bit r e g i s t e r s i n e a ch d a t a p a t h
4 4 0 b it s ca n b e s t or e d in a d ja cen t e v e n /od d r e g i s t e r s
n F i x e d -p o i n t ( C 6 2 x ) a n d floa t in g -p o i n t ( C 6 7 x )
n T M S 3 2 0 C 6 2 0 1 : $ 2 5 in v olu m e
4 1 5 0 M H z , 3 0 0 m illion M A C s /s e c, 1 2 0 0 R I S C M I P S
4 O n -ch i p m e m or y : 1 6 k x 3 2 p r o g r a m , 3 2 k x 1 6 d a t a
15
TI TMS320C6x VLIW DSP Architecture
n O n e in s t r u ct ion cycle e v e r y clock cycle
n D e e p p ipelin e
4 7 - 1 1 s t a g e s i n C 6 2 x : fe t ch 4 , d e c o d e 2 , e x e c u t e 1 -5
4 7 - 1 6 s t a g e s i n C 6 7 x : fe t ch 4 , d e c o d e 2 , e x e c u t e 1 - 1 0
4 I f a b r a n ch i s i n t h e p i p e l i n e , in t e r r u p t s a r e d i s a b led
(t h e la t e n cy of a b r a n ch is 5 cycles)
4 Avoid b r a n ch e s b y u s in g con d it ion a l e x e c u t ion
n N o h a r d w a r e p r ot e ct ion a g a in s t p i p e l i n e h a z a r d s
4 C o m p iler a n d a s s e m b l e r m u s t p r e v e n t p i p e l i n e h a z a r d s
n C 6 7 x com p u t e s floa t in g -p o i n t m u lt iply in 4 cycle s
16
C5x and C6x Addressing Modes
n Immediate TMS320C5x TMS320C6x
4 The operand is part of the
instruction ADD #0FFh add .L1 -13,A1,A6
n Register
4 The operand is specified in a
(implied) add .L1 A7,A6,A7
register
n Direct
4 The address of the operand is
part of the instruction (added ADD 010h not supported
to imply memory page)
n Indirect
4 The address of the operand is ADD * ldw .L1 *A5++[8],A1
stored in a register
17
TMS320C6x vs. Pentium MMX
P r ocessor Peak BDTI ISR Power Unit Area Volum e
MIP S m a r k s latency Price
P e n t iu m 466 49 1 . 1 4 s 4.25 W $ 2 1 3 5 .5 x 2.5 8 .789 in 3
M M X 233
P e n t iu m 532 56 1 . 0 0 s 4.85 W $ 3 4 8 5 .5 x 2.5 8 .789 in 3
M M X 266
C62x 1200 74 0 . 1 2 s 1.45 W $ 2 5 1 .3 x 1.3 0 .118 in 3
150 MH z
C62x 1600 99 0 . 0 9 s 1.94 W $ 9 6 1 .3 x 1.3 0 .118 in 3
200 MH z
BDTImarks: Berkeley Design Technology Inc. DSP benchmark
results (larger means better) http://www.bdti.com/bdtimark/results.htm
http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/processors.html
18
Application: FIR Filter
z-1 z-1 z-1
n E a ch t a p r e q u ir e s
4 F e t ch in g o n e d a t a s a m p l e
4 F e t ch in g o n e o p e r a n d
4 M u lt i p l y i n g t w o n u m b e r s
4 Accu m u la t in g m u lt iplica t ion r e s u lt
4 S h ift in g o n e s a m p l e i n t h e d e l a y l i n e
n Computing an FIR tap in one instruction cycle
4 Three data memory accesses
4 Auto-increment or decrement addressing modes
4 Modulo addressing to implement delay line as circular buffer
19
Application: FIR Filter on a TMS320C5x
Coefficients
Data
COEFFP .set 02000h ; Program mem address
X .set 037Fh ; Newest data sample
LASTAP .set 037FH ; Oldest data sample
LAR AR3, #LASTAP ; Point to oldest sample
RPT #127
MACD COEFFP, *- ; Do the thing
APAC
SACH Y,1 ; Store result -- note shift
20
Application: FIR Filter on a TMS320C62x
Coefficients
Data
Single-Cycle Loop
...
C7: ldh .D1 *A1++, A2 ; Read coefficient
|| ldh .D2 *B1++, B2 ; Read data
|| [B0] sub .L2 B0, 1, B0 ; Decrement counter
|| [B0] B .S2 c7 ; Branch if not zero
|| mpy .M1x A2, B2, A3 ; Form product
|| add .L1 A4, A3, A4 ; Accumulate result
...
21
Ordered Dithering on a TMS320C62x
periodic
1/8 5/8 7/8 3/8
array of
thresholds
7/8 3/8 1/8 5/8
Throughput of two cycles
; remove next two lines if thresholds in linear array
MVK .S1 0x0001,AMR ; modulo block size 2^2
MVKH .S1 0x4000,AMR ; modulo addr reg B6
; initialize A6 and B6
.trip 100 ; minimum loop count
dith: LDB .D1 *A6++,A4 ; read pixel
|| LDB .D2 *B6++,B4 ; read threshold
|| CMPGTU .L1x A4,B4,A1 ; threshold pixel
|| ZERO .S1 A5 ; 0 if <= threshold
[A1] MVK .S1 255,A5 ; 255 if > threshold
|| STB .D1 A5,*A6++ ; store result
||[B0] SUB .L2 B0,1,B0 ; decrement counter
||[B0] B .S2 dith ; branch if not zero
22
DSP Cores
n ASIC with:
4 Programmable DSP
4 RAM
4 ROM
4 Standard cells
4 Codec
4 Peripherals
4 Gate array
4 Microcontroller
23
DSP on General Purpose Processors
n M u lt im e d ia a p p lica t ion s on P C s
4 Vid e o, a u d io, gr a p h ics a n d a n i m a t i o n
4 R e p e t it i v e p a r a l l e l s e q u e n ces of in s t r u ct ion s
n N a t ive sign a l processin g e x a m p les
4 S u n Vis u a l I n s t r u ct ion S e t (U lt r a S P A R C 1 /2)
4Intel MMX (Pentium I/II/III)
4 I n t e l C o n cu r r e n t S I M D -F P (P e n t iu m I I I )
n S in g l e I n s t r u ct ion M u lt i p l e D a t a (S I M D )
4 O n e in s t r u ct ion a ct s on m u lt i p l e d a t a i n p a r a l l e l
4 W e ll-s u it e d for g r a p h ics
24
DSP on General Purpose Processors (cont)
n P r ogr a m m in g i s c o n s i d e r a b l y t o u g h e r
4 C / C + + com p iler s d o n ot g e n e r a t e n a t i v e s i g n a l p r o c e s s i n g
cod e e x cep t M e t r o w e r k s C o d e W a r r ior 5 g i v e s M M X cod e
4 L i b r a r ies of r ou t in e s u s in g n a t i v e s i g n a l p r o c e s s i n g
4 H a n d cod e u s in g i n - l i n e a s s e m b ly for b e s t p e r for m a n ce
4 P a ck /u n p a ck d a t a n ot a lign e d on S I M D w or d b ou n d a r i e s
4 5 0 -cycle p e n a lt y t o s w it ch t o M M X ; 0 p e n a l t y f o r V I S
4 S a t u r a t ion a r it h m e t ic in M M X; n ot s u p p or t e d in V I S
4 E x t e n d e d -p r e cision a ccu m u la t ion in M M X; n on e in V I S
n S p e e d u p for a p p lica t ion s
4 S ign a l a n d im a g e p r oce s s in g - 1 . 5 : 1 t o 2 : 1
4 G r a p h ics - 4:1 t o 6:1 (n o p a c k i n g /u n p a ck in g )
25
Intel MMX Instruction Set
n 6 4 -bit S I M D r e g i s t e r ( 4 d a t a t y p e s )
4 6 4 -b it q u a d w or d
4 P a ck e d b y t e ( 8 b y t e s p a ck e d in t o 6 4 b i t s )
4 P a ck e d w or d ( 4 1 6 -b it w or d s p a ck e d in t o 6 4 b i t s )
4 P a ck e d d ou b l e w o r d (2 d ou b l e w or d s p a ck e d in t o 6 4 b i t s )
n 5 7 n e w in s t r u ct ion s
4 P a ck a n d u n p a ck
4 A d d , s u b t r a ct , m u lt iply, a n d m u lt i p l y /a ccu m u la t e
n S a t u r a t ion a n d w r a p a r ou n d a r it h m e t ic
n M a xim u m p a r a lle l i s m p o s s i b l e
4 8 :1 for 8 -bit a d d it ion s
4 4 :1 for 8 x 1 6 m u lt iplica t ion or 1 6 -bit a d d it ion s
26
Concluding Remarks
n C o n v e n t ion a l digit a l sign a l p r o c e s s o r s
4 H igh p e r for m a n ce v s . p o w e r c o n s u m p t i o n / c o s t / v o l u m e
4 E x cel a t on e -d im e n s i o n a l p r o c e s s i n g
4 P e r c y c l e : 1 1 6 x 1 6 M A C & 4 1 6 -bit R I S C in s t r u ct ion s
n TMS320C6x VLIW DSP
4 H igh p e r for m a n ce v s . cos t /volu m e
4 E x cel a t m u lt i d i m e n s ion a l s i g n a l p r o c e s s i n g
4 P e r c y c l e : 2 1 6 x 1 6 M A C s & 4 3 2 -bit R I S C in s t r u ct ion s
n N a t i v e S ign a l P r oce s s in g
4 A v a ila b le on d e s k t op com p u t e r s
4 E x c e l s a t g r a p h ics
4 P e r c y c l e : 2 8 x 1 6 M A C s O R 8 8 -bit R I S C in s t r u ct ion s
n I n -lin e a s s e m b ly cod e for b e s t p e r for m a n ce
27
Concluding Remarks
n D igit a l s i g n a l p r o c e s s o r m a r k e t
4 4 0 % a n n u a l gr o w t h r a t e s in c e 1 9 9 0
4 $ 3 .5 billion r e v e n u e i n 1 9 9 8
4 4 5 % T I , 2 5 % L u cen t , 1 0 % M ot o r o l a , 8 % A n a l o g D e v i c e s
n I n d e p e n d e n t b e n ch m a r k in g b y i n d u s t r y
4 B e r k e l e y D e s i g n T e ch n ology In c. h t t p ://w w w .b d t i.com
4 E D N E m b e d d e d M icr o p r o c e s s o r B e n c h m a r k
C o n s or t iu m h t t p ://w w w .e e m bc.or g
n W e b r e s ou r c e s
4 com p .d s p n e w s g r ou p : F A Q w w w .b d t i.com /fa q /d s p _fa q .h t m l
4 e m b e d d e d p r oce s s or s a n d s y s t e m s : w w w .eg3.com
4 on -lin e cou r s e s a n d D S P b oa r d s : w w w .t e ch on lin e .com
28
References
1 G . E . A l l e n , B . L . E v a n s , a n d D . C. S ch a n b a ch e r , R e a l-T i m e S o n a r
B e a m for m in g on a U n ix W or k s t a t ion , Proc. I E E E A s i l o m a r C on f. O n
S i g n a l s , S y s t em s , a n d C om p u t e r s , p p . 7 6 4 - 7 6 8 , 1 9 9 8 .
h t t p ://w w w .e c e .u t e x a s .e d u /~ b e v a n s /p a p e r s /1 9 9 8 / b e a m for m in g /
2 R . B h a r g a v a , R . R a d h a k r i s h n a n , B . L . E v a n s , a n d L . K . J oh n , E v a lu a t in g
M M X T e c h n ology U s in g D S P a n d M u lt i m e d i a A p p l i c a t i o n s , Proc. I E E E
S y m . O n M icroarch itectu r e , p p . 3 7 - 4 6 , 1 9 9 8 .
h t t p ://w w w .e c e .u t e x a s .e d u /~ r a v i b /m m x d s p /
3 W . C h e n , H . J . R e e k i e , S . B h a v e , a n d E . A . L e e , N a t i v e S i g n a l P r oce s s i n g
on t h e U lt r a S P A R C in t h e P t olem y E n v i r on m e n t , Proc. I E E E A silom a r
C o n f. O n S i g n a l s , S y s t em s , a n d C om p u t ers , 1 9 9 6 .
h t t p ://w w w .e c e .u t e x a s .e d u /~ b e v a n s /cou r s e s /e e 3 8 2 c /lect u r e s /21_n s p /v i s /
4 B . L . E v a n s , E E 3 7 9 K - 1 7 R e a l - T i m e D S P L a b or a t or y , U T A u s t i n .
h t t p ://w w w .e c e .u t e x a s .e d u /~ b e v a n s /cou r s e s /r e a lt im e /
5 B . L . E v a n s , E E 3 8 2 C E m b e d d e d S o ft w a r e S y s t e m s , U T A u s t i n .
h t t p ://w w w .e c e .u t e x a s .e d u /~ b e v a n s /cou r s e s /e e 3 8 2 c /
6 A. K u lk a r n i a n d A. D u b e , E v a lu a t ion of t h e C o d e G e n e r a t i o n D o m a in in
P t olem y , h t t p ://w w w .e c e .u t e x a s .e d u / ~ b e v a n s /t a l k s / b e n c h m a r k in g 9 7 /s l d 0 0 1 . h t m
7 P . L a p s l e y , J . B i e r , A. S h oh a m , a n d E . A. L e e , D S P P r ocessor
F u n d a m en t a l s , I E E E P r e s s , 1 9 9 7 .
29