0% found this document useful (0 votes)
82 views130 pages

DSP Prcoessor

This thesis describes the design of a domain specific DSP processor. It is divided into two parts - the first provides theoretical background on DSP processors and describes the design process and decisions for this processor. The second part provides a nearly complete design specification for the processor. The intended use is for hardware acceleration, but that support has not yet been implemented. The design process includes instruction set design, benchmarking, RTL implementation, and verification. The specification includes details of the data path, instruction set, and pipeline design.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views130 pages

DSP Prcoessor

This thesis describes the design of a domain specific DSP processor. It is divided into two parts - the first provides theoretical background on DSP processors and describes the design process and decisions for this processor. The second part provides a nearly complete design specification for the processor. The intended use is for hardware acceleration, but that support has not yet been implemented. The design process includes instruction set design, benchmarking, RTL implementation, and verification. The specification includes details of the data path, instruction set, and pipeline design.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

A Domain Specific DSP Processor

Eric Tell

Reg nr: LiTH-ISY-EX-3209

Supervisor: Mikael Olausson


Examiner: Dake Liu

Linköping 2001
Abstract

This thesis describes the design of a domain specific DSP processor.

The thesis is divided into two parts. The first part gives some theoretical back-
ground, describes the different steps of the design process (both for DSP processor
design in general and for this project) and motivates the design decisions made for
this processor.
The second part is a nearly complete design specification.
The intended use of the processor is as a platform for hardware acceleration
units. Support for this has however not yet been implemented.
Contents

I Design of a Domain Specific DSP Processor 5


1 Introduction 6
1.1 Purpose of this processor . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Reading guidelines . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 DSP vs. General Purpose Processors 8


2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 The MAC Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Saturation Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Special Addressing Modes . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Modulo Addressing . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Bit-Reversed Addressing . . . . . . . . . . . . . . . . . . 10
2.5 Hardware Looping . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Different Types of DSP Processors . . . . . . . . . . . . . . . . . 11

3 The Design Flow 12


3.1 Requirement Analysis . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Instruction Set Design and Architecture Planning . . . . . . . . . 12
3.3 Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 RTL Implementation . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Instruction Set Analysis 16


4.1 Choosing The Instruction Set . . . . . . . . . . . . . . . . . . . . 16
4.2 This Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1
5 Machine Code Design 18
5.1 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 The Instruction Word of This Processor . . . . . . . . . . . . . . 19

6 Top Level Architecture 21


6.1 Mapping the Instruction Set to Hardware . . . . . . . . . . . . . . 21
6.2 The Register File . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Concurrent Design of Instruction Set and Architecture . . . . . . 23

7 Instruction Set Simulator 24


7.1 What? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.2 Why? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.2.1 The Assembler . . . . . . . . . . . . . . . . . . . . . . . 24
7.2.2 A Behavioral Model . . . . . . . . . . . . . . . . . . . . 25
7.2.3 Verification . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2.4 Concurrent Engineering . . . . . . . . . . . . . . . . . . 25
7.3 How? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . 27

8 Benchmarking 30
8.1 MIPS and MACS . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.2 Application Benchmarking . . . . . . . . . . . . . . . . . . . . . 31
8.3 Algorithm Kernel Benchmarking . . . . . . . . . . . . . . . . . . 31
8.4 Tools for Benchmarking . . . . . . . . . . . . . . . . . . . . . . 32
8.5 Benchmarks for This Processor . . . . . . . . . . . . . . . . . . . 32

9 Pipeline and Control Path 34


9.1 The Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.2 Jumps and Branches . . . . . . . . . . . . . . . . . . . . . . . . 35
9.3 Hardware Looping . . . . . . . . . . . . . . . . . . . . . . . . . 36

10 RTL Implementation 38
10.1 Micro Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 38
10.2 VHDL Implementation . . . . . . . . . . . . . . . . . . . . . . . 38

2
11 Verification 40
11.1 The Verification Strategy . . . . . . . . . . . . . . . . . . . . . . 40
11.2 Verification for This Project . . . . . . . . . . . . . . . . . . . . . 41
11.2.1 Block Level Verification . . . . . . . . . . . . . . . . . . 41
11.2.2 Instruction Level Verification . . . . . . . . . . . . . . . . 41
11.2.3 Random Testing . . . . . . . . . . . . . . . . . . . . . . 42
11.2.4 Application Level Verification . . . . . . . . . . . . . . . 42

12 Conclusions and Future Improvements 44


12.1 Results and Conclusions . . . . . . . . . . . . . . . . . . . . . . 44
12.2 Alternative Solutions . . . . . . . . . . . . . . . . . . . . . . . . 44
12.3 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . 45
12.3.1 Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . 45
12.3.2 I/O Ports . . . . . . . . . . . . . . . . . . . . . . . . . . 46
12.3.3 Additional Instructions . . . . . . . . . . . . . . . . . . . 46
12.3.4 Hardware Accelerator and Multiprocessor Support . . . . 46

II Design Specification 47
13 Introduction 48
13.1 Processor Features . . . . . . . . . . . . . . . . . . . . . . . . . 48
13.2 Outline of This Part of the Thesis . . . . . . . . . . . . . . . . . . 49

14 Data Path 50
14.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . 50
14.2 Arithmetic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
14.3 Shift Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
14.4 Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
14.5 MAC Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
14.5.1 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . 54
14.5.2 Saturation Unit . . . . . . . . . . . . . . . . . . . . . . . 54
14.6 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
14.6.1 The Accumulator Registers . . . . . . . . . . . . . . . . . 57
14.6.2 The Control Register . . . . . . . . . . . . . . . . . . . . 57
14.7 Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
14.7.1 Addressing Modes . . . . . . . . . . . . . . . . . . . . . 60
14.7.2 Modulo Addressing . . . . . . . . . . . . . . . . . . . . . 61

3
14.7.3 Bit-Reversed Addressing . . . . . . . . . . . . . . . . . . 61
14.8 The Status Register . . . . . . . . . . . . . . . . . . . . . . . . . 61

15 Control Path 63
15.1 The Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
15.2 Instruction Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 64
15.3 Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . 65
15.4 Program Flow Controller . . . . . . . . . . . . . . . . . . . . . . 65
15.4.1 Subroutine Calls - The PC Stack . . . . . . . . . . . . . . 65
15.4.2 Hardware Looping . . . . . . . . . . . . . . . . . . . . . 66
15.5 Pipeline Controller . . . . . . . . . . . . . . . . . . . . . . . . . 67
15.6 Branch Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 67

16 Instruction Set 68
16.1 The Instruction Word . . . . . . . . . . . . . . . . . . . . . . . . 68
16.2 Parallel Memory Instructions . . . . . . . . . . . . . . . . . . . . 69
16.2.1 Move to Memory: . . . . . . . . . . . . . . . . . . . . . 69
16.2.2 MAC Operation and Load . . . . . . . . . . . . . . . . . 70
16.2.3 Arithmetic, Logic, Shift or Move Operation and Load . . . 70
16.3 Instruction Set Restrictions . . . . . . . . . . . . . . . . . . . . . 70
16.3.1 Branch and Jump Instructions . . . . . . . . . . . . . . . 71
16.3.2 Hardware Loops . . . . . . . . . . . . . . . . . . . . . . 71
16.3.3 Modulo Addressing . . . . . . . . . . . . . . . . . . . . . 71
16.4 Instruction Encoding . . . . . . . . . . . . . . . . . . . . . . . . 72

A Instruction set summary 75

B Assembly code for FIR-filter 124

C Pipeline Timing Analysis 125

4
Part I

Design of a Domain Specific DSP


Processor

5
Chapter 1

Introduction

DSP is an abbreviation for Digital Signal Processing. Accordingly, a DSP pro-


cessor is a processor that is designed specifically for signal processing tasks.
The purpose of the project described in this thesis, was to design a 16-bit fixed
point DSP Processor.
The Project includes all steps from instruction set analysis, machine code de-
sign and architecture planning, via a C++ behavioral model to register transfer
level implementation using VHDL.

1.1 Purpose of this processor


The processor is intended to be used as a platform for hardware accelerators. That
is, it should be possible to easily connect application specific hardware units to the
processor core. For this reason the instruction set of the processor is quite simple
and instead space for adding hardware accelerator instructions has been reserved.
Furthermore the processor is intended to be used in a system of up to four
similar processors that should be able to share memory.
Due to the limited time available for this project however, the actual hardware
for supporting hardware accelerators and multiple processors has not yet been
implemented (although it is supported by the instruction set).

1.2 Reading guidelines


The thesis is divided into two parts. The first part gives some background to DSP
processor design and describes the design flow and the design decisions made in

6
this project. The second part is a more or less complete design specification for
the processor.
The first part has the following contents:

Chapter 2 describes the special features of DSP processors that separates them
from general purpose processors.

Chapter 3 gives an overview of the design process.

Chapters 4 to 11 goes into details on different parts of the design.

Chapter 12 has some conclusions and proposals for future improvements.

Many of chapters 4 to 11 have a corresponding section in part two of the report


and the reader may want to look ahead to see the actual implementation of some
part of the processor, before continuing to the next chapter.

7
Chapter 2

DSP vs. General Purpose Processors

This chapter describes some important differences between DSP processors and
general purpose processors. It also explains some concepts that are used later in
the report.

2.1 Architecture
The most important difference between the architecture of DSP processors and
general purpose processors is probably the possibilities for multiple memory ac-
cesses in one clock cycle. Generally a DSP processor has separate program and
data memories. This allows the processor to fetch an instruction, while simulta-
neously fetching operands or storing results for a previous instruction. Often it
is also possible to fetch multiple data from memory in one clock cycle by using
multiple busses and multi port memories or multiple independent data memories.

2.2 The MAC Unit


The single most typical feature of DSP processors is the dedicated hardware for
multiply-and-accumulate or MAC operations. The MAC operation is used for
calculating a sum of products - two operands are multiplied and the product is
added (or subtracted) to a cumulative sum. The MAC operation is very common
in DSP applications and is used for example for vector products, digital filters,
correlation and Fourier transforms.
Usually the operands in the addition/subtraction has more bits than the output
of the multiplication. The extra bits are called guard bits. The guard bits makes it

8
possible to accumulate a number of values without the risk of overflow. If n guard
2
bits are used n values can be accumulated without the possibility of overflow.
Most DSP processors have four or eight guard bits.
The MAC hardware usually also supports saturation (see 2.3 below) and round-
ing, to get a result of the native data width1 .
Example: If the native data width is n, then the result of the multiplication will
2
have n bits. So with m guard bits the result of the MAC operation will have
2 +
n 2
m bits. This value can be saturated to a n-bit value and then rounded to get

a value of the native data width n, that can be stored in memory or used in other
kinds of operations.

2.3 Saturation Arithmetic


Normally if the result of an arithmetic operation in a hardware unit is outside the
data range the result will “wrap around”. For example if one is added to the highest
possible number, the result will be the lowest possible number. Saturation on the
other hand means that if the real result is larger than what can be represented with
the available number of bits, the output will be the highest possible value and if
the result is lower than the lowest value that can be represented, the result will be
the lowest possible value. The difference is illustrated in figure 2.1.
Using saturation arithmetic reduces distortion due to overflow and may also
prevent parasitic oscillations in recursive algorithms [5].
Saturation arithmetic is basically always supported for the MAC operations
in DSP processors and sometimes also for other operations (like addition and
subtraction).

2.4 Special Addressing Modes


Addressing modes of DSP processors are chosen to fit the applications. The
most common memory addressing mode is register indirect addressing with post-
increment, which is used to execute repetitive operations on data stored sequen-
tially in memory. Two other special addressing methods common in DSP proces-
sors are described below.
1
The native data width is the width of data memory and most busses

9
a) Normal arithmetic hardware b) Saturation arithmetic
Figure 2.1: The difference between saturation and normal arithmetic. The X-axis
is the “real” result and the Y-axis is the output from hardware

2.4.1 Modulo Addressing


Modulo addressing (or circular addressing) means that when a memory pointer
(used for example for post-increment addressing) reaches the end of a specified
memory area, it automatically starts over from the beginning. This can be used
for example for implementing circular data buffers.

2.4.2 Bit-Reversed Addressing


The bit-reversed addressing mode is used specifically for implementing the fast
Fourier transform (FFT) algorithm. The problem with the FFT algorithm is that
it either takes its input or leaves its output in a scrambled order, so at some point
the order of the data has to be rearranged.
The most common form of the FFT requires the data to be taken in bit-reversed
order. The term bit-reversed comes from the fact that the ordering matches the
output one would get from a binary counter if the bits where taken in reversed
order (that is the least significant bit first). This is illustrated below

Normal order Bit-reversed order


000 = 0 000 = 0
001 = 1 100 = 4
010 = 2 010 = 2
011 = 3 110 = 6
100 = 4 001 = 1
101 = 5 101 = 5

10
110 = 6 011 = 3
111 = 7 111 = 7

Because the FFT algorithm is so common, many DSP processors have hard-
ware support for bit-reversed addressing.

2.5 Hardware Looping


Since many DSP algorithms are based on repetitive computations most DSP pro-
cessors provide hardware support for efficient looping. Usually there is a loop
or repeat instruction, that allows loops to be implemented without spending any
extra clock cycles for testing and updating the loop counter, or for jumping back
to the start of the loop.

2.6 Different Types of DSP Processors


DSP Processors can be divided into three categories: A general purpose DSP
processor usually has a large instruction set can be used in almost any DSP ap-
plication. A domain specific DSP processor is made for a special category of
applications, for example audio processing. Finally an application specific DSP
processor is developed for one single application only.

11
Chapter 3

The Design Flow

This chapter gives an overview of the design flow for a DSP Processor. Figure 3.1
illustrates the flow and a short description of every step follows.

3.1 Requirement Analysis


Before the actual design starts you have to know what to design. In the require-
ment analysis step it is specified what the processor should be able to do and
demands on performance are carefully analyzed. For an application specific DSP
processor the requirements are generally determined by the system in which the
processor will be used. In this project no requirement analysis was made, since it
is not known exactly what the processor will be used for.

3.2 Instruction Set Design and Architecture Plan-


ning
In the instruction set design step, it is decided which instructions should be avail-
able in the processor.
The format of the instruction word is also decided and a plan for the top level
architecture is made. It is necessary to do this in conjunction with the instruction
set design to make sure it is possible to actually implement all instructions in
hardware.
These activities are described in chapters 4, 5 and 6.

12
Requirement analysis

Instruction set design/


architecture planning

Behavioral model

Benchmarking

Architecture design

RTL Implementation

Verification

Release RTL Implementation

Figure 3.1: The DSP processor design flow

3.3 Behavioral Model


When a suggestion for an instruction set is ready, a behavioral model of the proces-
sor is written in some high level language, for instance C. The behavioral model,
or instruction set simulator, is a program that simulates the behavior of the pro-
cessor on instruction level. It is used in the benchmarking step and also allows
software engineers to test software before the actual processor exists.
Chapter 7 says more about the instruction set simulator.

3.4 Benchmarking
Benchmarking is used to verify that the instruction set offers sufficient perfor-
mance to fulfill the requirements set up during requirement analysis. If it does

13
not, the instruction set has to be modified. Typically performance is increased
by moving tasks from software to hardware. After a few iterations, hopefully a
working instruction set can be released. After this is done the software engineers
can start their work concurrently with the hardware development.
Chapter 8 describes benchmarking further.

3.5 Architecture Design


This step is basically just top-down design of the whole processor architecture,
ending at register-transfer level (RTL; basically registers, buses, multiplexers and
primitive arithmetic units).

3.6 RTL Implementation


Application and domain specific DSP processors are typically implemented using
a hardware description language (HDL), usually VHDL or Verilog. The HDL
implementation can be simulated with simulation tools and other tools can be
used to synthesize hardware.
Full custom design (“drawing transistors on silicon by hand”) could also be
used for timing-critical parts.

Chapter 10 says a little more about architecture design and RTL implementa-
tion.

3.7 Verification
Verification is a very large and very important part of the design process. Although
it is the final step before the implementation is released it is important to have a
good verification strategy from the beginning and to keep it in mind during every
step of the flow.
The verification can be divided into functional verification, where for example
the logical correctness of HDL code is verified, and physical verification ,which
means verifying for example timing constraints. To perform physical verification
obviously the HDL code (or at least parts of it) has to be synthesized first.
In this project no synthesis and thus no physical verification has been made
yet.

14
If errors are found during verification one, has to go back to the RTL im-
plementation or architecture design to make corrections. When the verification
result is satisfactory (it is impossible to test everything) the RTL implementation
is released.
The functional verification process is described more thoroughly in chapter
11.

15
Chapter 4

Instruction Set Analysis

4.1 Choosing The Instruction Set


The instruction set is the interface between hardware and software. To design the
instruction set you need to know both what jobs should be run on the processor
and what parts of the jobs should be done in hardware and software respectively.
The second is however not an easy question. Implementing a certain function
in hardware, instead of as a subroutine, of course increases hardware complexity,
but saves program memory and probably increases performance for executing that
particular function. The right choice naturally varies from one application to an-
other depending on how often the function is used and what the requirements are
on performance, memory and so on.
Some points to consider for designing a DSP processor with good performance
might be [1]:

1. Make the instruction set simple

2. Avoid having instructions of different length

3. All normal instructions should be executed in one clock cycle

4. Normal instructions should only use operands from registers.

5. Use a multiple bus architecture that allows multiple memory accesses in one
clock cycle.

6. Use dedicated multiply and accumulate (MAC) hardware.

16
7. Provide support for fast hardware looping.

8. Provide hardware support for modulo and bit-reversed addressing.

All these recommendations have been followed in this project.

4.2 This Processor


For this processor there was no particular demands on performance nor any spe-
cial type of jobs that should be run. Therefore the instruction set has been kept
quite simple. However some instructions typical for DSP processors have been
included. These are:

MAC and multiply instructions.


Hardware looping instructions.
32-bit shift and add instructions

Most of the addressing modes common in DSP processors are also supported.
Furthermore the instruction set allows memory accesses in parallel with compu-
tational operations (execute one operation and simultaneously load operands for
the next one from memory). This can improve the performance for many algo-
rithms (for example convolution based algorithms like FFT and FIR/IIR-filters)
most significantly.
See chapter 16 and appendix A for a complete description of the instruction
set.

17
Chapter 5

Machine Code Design

This chapter discusses how to choose the instruction encoding and explains the
choice of instruction word for this processor.

5.1 Orthogonality
One way to measure how “good” the instruction set of a processor is, is the con-
cept of orthogonality. On instruction set level, orthogonality refers to the com-
pleteness and consistency of the instruction set and to which degree different ad-
dressing modes are uniformly available with different operations [2]. For example
a processor that has an add function but not a subtract function, or where the sub-
tract function supports different addressing modes than the add function would be
considered nonorthogonal.
On machine code level orthogonality relies on the principle of dividing the in-
structions into different groups of instructions that works similarly. The machine
code can then be multiplexed, except for the multiplex control field of the binary
code that chooses which group the instruction belongs to. This significantly sim-
plifies the instruction decoding, since most control signals can be decoded from
only a small number of the instruction word bits.
Figure 5.1 shows an example of an orthogonal instruction word.
Another way of increasing orthogonality and simplifying decoding is to di-
vide the instruction word into subfields that as far as possible always have the
same function. For example the bits selecting the instruction and those select-
ing operands should be separated in the instruction word and source/destination
register should always be decided by the same bits.

18
Bits selecting
instruction group Bits selecting
(arithmetic, logic, mac, etc) source register

bit n bit 0
Mux Op Source Dest

Bits selecting Bits selecting


operation within destination register
the instruction group

Figure 5.1: An instruction word and its subfields. An instruction set where all
instructions used this format would be considered highly orthogonal.

The disadvantage of a highly orthogonal instruction set is that it needs a longer


instruction word. Since all instructions don’t use all subfields, better orthogonality
means more redundancy in the encoding. Longer instruction word means larger
bus and memory widths, which increases the system cost. Obviously a tradeoff has
to be made. For example it is quite common to have restrictions on what registers
can be used as source/destination for different instructions. It is also common to
use control bits, that partly determines the behavior of different instructions.
This processor uses control bits to control modulo and bit-reversed addressing,
to enable or disable saturation for arithmetic and shift units and to choose between
integer and fractional mode for multiplication (see 14.6.2).

5.2 The Instruction Word of This Processor


For this processor a 32-bit instruction word has been chosen. This is a quite long
instruction word considering the limited number of instructions and addressing
modes. If this had been a commercial product, probably a shorter instruction
word length would have been chosen. All available instructions could certainly
have been implemented using a 29 or 30-bit instruction word. With a little more
use of implied addressing and some restrictions on operands, most of it could even
have been possible with a 24-bit instruction set.
However the 32-bit instruction word gives some advantages that would not
have been possible with a 24-bit instruction word, for example very good orthog-

19
onality and lots of space for new instructions and future improvements. Some
instructions, particularly those using 16-bit immediate data, could hardly have
been implemented at all using a 24-bit instruction set.
Furthermore since this processor is partly for demonstration purposes the sim-
plicity provided by a highly orthogonal instruction set was even more preferable.
The instruction word is further described in section 16.1.

20
Chapter 6

Top Level Architecture

This chapter describes the design of the top level architecture - Computational
units, busses, memories and registers.

6.1 Mapping the Instruction Set to Hardware


The principle of the architecture planning is to map one instruction at the time
into hardware until all instructions are executable.
Example: For the add function we need some kind of register file where the
operands are stored. Then, since we want the addition to be executed in one clock
cycle we need two operand buses from the register file to some arithmetic unit
that performs the operation. Finally we need some result bus back to the register
file. Next we look at the move to memory and load from memory instructions. Let
us assume both address and data is in the same register file as the operands for
the add function. For the move to memory instruction we need an address bus
and a data buss from the register file to the memory. For the load instruction we
should be able to use the same address bus, but we add another data bus from
memory to the register file.
The process continues like this until we are sure all instructions can be exe-
cuted. The next step is to try to find buses that can be multiplexed, in this case for
example the data bus from register file to memory could probably be the same as
one of the operand buses to the arithmetic unit. The resulting architecture for this
processor can be found in figure 14.1.
Note: The processor will be implemented using CMOS-technology and hence
output buses from memories and different computational units cannot easily be

21
connected together - tri-state buffers are not used.

6.2 The Register File


One of the first things to decide in the design process is the organization of regis-
ters. The main questions are how many registers are needed and if each register
should be used only for a specific purpose or if they should be general purpose
registers.
More registers means easier programming and possibly fewer memory ac-
cesses but more hardware, more instruction word bits for addressing and higher
power consumption.
Special purpose registers makes it possible to use fewer bits for addressing.
General purpose register on the other hand increases flexibility since a data value
in a certain register can be used for any (or almost any) operation.
A quite common compromise is to have special purpose registers for address-
ing and special accumulator registers for MAC operations. One reason to do it
like this is that these registers often do not have the same width as the general
purpose registers (at least the accumulator registers certainly do not).
In this processor there are 32 general purpose registers that can all be used for
arithmetic, logic and shift operations. Eight of them can also be used as address
registers. Two of the address registers supports modulo addressing and one sup-
ports bit-reversed addressing. Another eight registers are used for other address-
ing purposes (like step size for post-incremental addressing or to specify modulo
addressing areas).
The processor has two 40-bit accumulator registers for MAC-operations (The
MAC unit uses a 32-bit multiplier and 8 guard-bits for the accumulator). Each of
these uses three of the general purpose registers - One is used for the lower 16
bits, one for the higher 16 bits and half of the third register is used for the eight
guard bits.
Using only half of the third register feels a bit awkward sometimes - there
will hardly be any use for the other eight bits of that register except maybe for
some sort of flag bits. It would have felt more natural to use all 16 bits as guard
bits, but that would have meant using a 48-bit accumulator which is much more
than anyone would have use for. Maybe it would have been better to extend the
“high” register with an extra eight bits (these would have been inaccessible to the
programmer but that really doesn’t matter because the result is always saturated
before it is used for anything else than MAC-unit operations anyway).

22
6.3 Concurrent Design of Instruction Set and Ar-
chitecture
In practice instruction set design, machine code design and architecture planning
are to a large extent done concurrently. When you decide to add an instruction
to the instruction set you also have to consider how it could be implemented in
hardware and if there is “enough space” for it in the instruction word. Otherwise
you will surely run into trouble at the later steps.
Furthermore as the architecture and machine code “evolves” you can often
find new instructions that can be implemented with very little extra cost (in terms
of hardware, instruction word length or loss of orthogonality)
So the development is in fact more of an iterative process - instruction set and
architecture are built up concurrently step by step.

The architecture of this processor is described in chapter 14.

23
Chapter 7

Instruction Set Simulator

This chapter describes the instruction set simulator (ISS), what it is, why there is
one and how it works.

7.1 What?
The Instruction set simulator is just what the name says - a program that simulates
the function of all the instructions of the processor.
The ISS simply loads a binary file generated by the assembler, transforms
it back to assembly language instructions and runs it instruction by instruction,
generating the exact same result as the actual processor would have. It also has
features for debugging, saving simulation results to file and more.

7.2 Why?
The ISS is very important in the design flow and is used to some extent in almost
every step.

7.2.1 The Assembler


Since the ISS does the inverse transformation of the assembler it can be used to
verify the function of the assembler - If the output assembly program of the ISS is
the same as the input to the assembler there is a good probability that the function
of the assembler is correct.

24
7.2.2 A Behavioral Model
The ISS is used to verify the behavior of the processor, that is to verify that it
really does exactly what it is intended to do, that it can really run all the kinds
of applications it is supposed to and, last but not least important, that it can do it
with sufficient performance (measured in number of useful instructions per clock
cycle or something similar; See also chapter 8). For these reasons a bit-true and
cycle-true ISS is needed, in other words it has to both produce exactly the right
results on instruction level and keep track of exactly how many clock cycles will
be used. (It is not as simple as just one instruction per clock cycle, especially with
a more complex pipeline)

7.2.3 Verification
Maybe the most important use of the ISS is for verification of the hardware. Be-
ginning at the Instruction level basically all verification of the hardware is done
by comparing the test results from the hardware with those generated by the ISS
behavioral model.

7.2.4 Concurrent Engineering


Another very important reason for having a good ISS early in the design process is
the possibilities for concurrent development of hardware and software. As soon as
the ISS is ready, software engineers can start developing application software al-
though the actual hardware does not exist. This is absolutely necessary to achieve
the short time to market that is needed today.

7.3 How?
This section describes the ISS developed for this project.

7.3.1 Features
Apart from disassembling and running the program, either the whole program or
one instruction at the time, and showing the contents of registers, the ISS has the
following features:

25
Modifying Registers and Memory
Contents of general purpose registers, program counter and memory can be altered
manually.

Breakpoints
Breakpoints can be entered causing execution of the program to halt at a specified
line of code.

Load/Save Memory to File


The contents of data and tap memory1 can be loaded from or saved to file. This is
useful for example for importing input data or filter coefficients generated by mat-
lab or for exporting execution results to other programs or comparing simulation
results.

Tracking Memory and Register Use


To simplify debugging the simulator keeps a record of which registers and mem-
ory positions have been loaded with values, either by the program or manually
by the user. If the program uses a register or memory position with an undefined
value a warning message is displayed.

Script files
All functions available within the simulator can be executed from a script file, that
can be run either from within the simulator or automatically at startup.

Batch Mode
The simulator has a special batch mode for use in for example shell scripts. In
batch mode the simulator automatically starts, loads a program, runs a script file
and quits. (The script would typically load input data from file, run the program
and save the output to another file.)
1
the term tap comes from digital filtering: a filter is divided into taps, each consisting of a
MAC operation where data is multiplied with a coefficient, so a tap memory is typically a data
memory holding (filter-) coefficients

26
7.3.2 Implementation
The ISS was implemented using C++. The code is divided into different files
so that everything that is dependent on the processor architecture is separated
from things related only to how the simulator works. Figure 7.1 and 7.2 shows
flow charts of the most important functions of the simulator, namely loading and
running a program and executing an instruction.

27
’run’ command issued

step = number of instructions to excute


’load’ command issued step=0 to execute whole program

Get filename
Execute instruction at address pc

File exists? no

yes
Check instruction on new pc address
Open file to see if a ’nop’ should be inserted

1: Read line
no yes
2: Interpret line ’nop’ inserted?

Correct clock=clock+1 clock=clock+2


Display
instruction no
error
code? message
Display
yes Warning yes warning
issued?
message
Add instruction if step=0
to program the whole program no
is executed
Break point yes
no at pc address?
End of file?

yes no

Reset yes
(pc=0, clock=0) pc>program size?

no
Display
status yes step=0?

no
Wait for user
command yes Display
step=1?
status
no

step=step−1
Wait for user
command

Figure 7.1: Loading and executing a program in the ISS.

28
a warning stops
execution after completing Fetch operands
the instruction

Issue a warning if not


all operands are defined
subroutine return

loop instruction

repeat instruction

other instruction
subroutine call

jump or branch

pop pcstack to jmpaddr push pc to pc stack if jump taken push to loop stack: repeatreg= 1: Make calculations
jumpdelay=3 jmpaddr = call address jumpdelay=3 loopstart= pc+1 operand 2: Save result
jumpdelay=3 jmpaddr = jump adress loopend = operand 3: Update statusflags
else loopcounter=loopreg
do nothing
jmpdelay>1

jmpdelay=0

jmpdelay=1

jumpdelay= pc=jmpaddr make


jmpdelay=0 delayed
jmpdelay−1
jump

repeatreg= Yes
repeat repeatreg−1 repeatreg>1?
instruction
No end of loop

normal No Yes
pc=pc+1 pc=loopend?
execution

pc=loopstart No Yes pc=pc+1


restart loopcounter=1? pop loop stack
loop loopcounter=loopcounter−1

Figure 7.2: Executing an instruction in the ISS.

29
Chapter 8

Benchmarking

A benchmark is some absolute measure of the performance of a processor.


Benchmarks are basically used for two tasks: To compare the performance of
different processors and to verify that a processor fulfills the necessary require-
ments.
In the DSP processor design flow, requirement verification is the important
part. Benchmarking is first used after the instruction set design, to verify that
the instruction set fulfills the performance requirements that were found during
requirement analysis.
Benchmarks for comparing different processors are important for marketing
purposes, or if you want to buy a commercial DSP processor for a system, instead
of designing one of your own. However it is not easy to find a benchmark that is
both relevant for the application where the processor will be used and gives a fair
comparison between different processors.

8.1 MIPS and MACS


Traditionally it has been common to measure performance in MIPS or Million
Instructions Per Second. This is a very simple metric, but it is often misleading,
especially for DSP processors. The reason is that the actual amount of useful work
performed by an instruction, varies a lot between different processors.
Because the multiply-and-accumulate operation is so common in DSP algo-
rithms the performance of DSP processors are often given in MACS (multiply-
accumulates per second). This is however also an unreliable measure, because
most applications use many operations other than MACs and many processors

30
can also perform other operations in parallel with MAC operations.

8.2 Application Benchmarking


Benchmarks using a complete application or suit of applications, are more suitable
than MIPS and MACS for comparing different processor families. Furthermore it
also makes it possible to measure for example memory use and power consump-
tion.
Application benchmarking values are often given as the number of MHz needed
to perform a certain task.
Example: Let us say that a processor has a benchmark of 20 MHz for real-time
speech encoding and 2 MHz for decoding. If we want to perform both tasks si-
multaneously on the processor we add the two numbers together (and add a little
more , maybe 10%, for control code) to get an estimate of what clock frequency is
necessary.
One problem with application benchmarking is that the applications are often
written in a higher level language, like C, and therefore the benchmark is a mea-
sure of the compiler as well as of the processor. Many low cost DSP processors
have quite inefficient compilers and the performance critical parts of the software
is typically coded in assembly language.
But even if the applications are coded in assembly language, it is difficult
to achieve an optimal or even near-optimal implementation, so the benchmark
becomes partly a measure of the skill of the programmer. It is also very time
consuming to develop complete applications for multiple processors.

8.3 Algorithm Kernel Benchmarking


A compromise between the oversimplified MIPS and MACS benchmark and the
complicated application benchmarking is algorithm kernel benchmarking. The
idea is to benchmark the algorithms that are the building blocks of most DSP
processing systems. These are quite simple algorithms to implement and you can
usually be sure you have the optimal implementation.
To evaluate a processor for a specific application, a weighted sum of the bench-
marks from kernel algorithms used in the application is calculated.
As an example of kernel algorithms, table 8.1 [3] lists the algorithms used
in the BDTI Benchmarks (BDTI - Berkeley Design Technology, Inc, is a com-

31
pany that, among other things, publishes impartial technical evaluations of DSP
processors).

8.4 Tools for Benchmarking


Generally an instruction set simulator is used for benchmarking. For this of course
a cycle-true ISS is needed. If benchmarks for things like power consumption are
wanted other methods have to be used, for example emulator hardware.

8.5 Benchmarks for This Processor


Due to lack of time no real benchmarking has been done for this project. However
the FIR filter mentioned in 11.2.4 is a typical kernel algorithm. The implemen-
tation used executes a FIR-filter with T taps and N samples in N T ( + 7) + 12
clock cycles which seems to be quite normal (some DSP processors on the market
are better, some are worse). With better I/O instructions this value would improve
further.

32
Function Description Example Application

Real Block FIR Finite impulse response fil- Speech processing (e.g.
ter that operates on a block G.728 speech encoding).
of real (not complex) data.

Complex Block FIR FIR filter that operates on Modem channel equalization.
on a block of complex data.

Real Single- FIR filter that operates on a Speech processing, general


Sample FIR single sample of real data. filtering.

LMS Adaptive FIR Least-mean-square adaptive Channel equalization, servo


filter; operates on a single control, linear predictive
sample of real data. coding.

IIR Infinite impulse response Audio processing, general


filter that operates on a filtering.
single sample of data.

Vector Dot Product Sum of the pointwise multi- Convolution, correlation, matrix
plication of two vectors. multiplication, multi-dimensio-
nal signal processing.

Vector Add Pointwise addition of two vec- Graphics, combining


tors, producing a third vector. audio signals or images.

Vector Maximum Finding the value and Error control coding,


location of the maximum algorithms using block
value in a vector. floating-point.

Viterbi Decoder Decode a block of bits that has Error control coding.
been convolutionally encoded.

Control A sequence of control opera- Virtually all DSP appli-


tions (test, branch, push, cations include some con-
pop and bit manipulation). trol code.

256-Point Fast Fourier Transform con- Radar, sonar, MPEG audio


In-Place FFT verts a time-domain signal compression, spectral
to the frequency domain. analysis.

Bit Unpack Unpacks variable length Audio decompression,


data from a bit stream. protocol handling.

Table 8.1: Kernel algorithms in the BDTI benchmarks.


33
Chapter 9

Pipeline and Control Path

The control path1 of a processor has three necessary parts. The first is the program
memory or control memory, where all the instructions of the program are stored.
The second is the program flow controller that generates the program counter (PC)
address, that points out the next instruction to be fetched from program memory.
Finally the instruction decoder decodes the control signals (both to control path
and data path) from the instruction word.
Usually there is also a PC stack for saving return addresses for subroutine
calls, hardware for supporting hardware looping, interrupt handling and many
other things (though many of these might be considered to be part of the program
flow controller).
This processor has a PC stack for subroutine calls, a loop stack for supporting
nested hardware loops, a repeat register for simple repeating of one instruction and
a pipeline controller whose purpose is described in 9.1 below. Interrupt handling
is not yet implemented.

9.1 The Pipeline


The execution of an instruction in a processor includes several steps. First the
instruction is fetched from program memory, then control signals are decoded
from the instruction. Next, operands may be fetched from memory or registers, an
operation could be performed by some computational unit and finally the result is
saved somewhere. The principle of pipelining is to divide this process into several
1
A processor is divided into the data path where all computations are made, and the control
path which generates the control signals to the data path

34
pipeline steps and execute all steps in parallel. This could mean for example
that in the same clock cycle as one instruction is fetched from memory, another
instruction is decoded, and yet another is executed by a computational unit. In
this way the performance of the processor is increased.
DSP processors usually use three or four pipeline steps, but other solutions
also exist. A longer pipeline allows the processor to execute faster, but program-
ming usually becomes a bit more complicated and branching effects (see 9.2) and
similar complications have greater impact.
This processor has a variable pipeline depth. Most instructions are executed
in three steps (fetch, decode and execute) but due to the long critical path of the
multiplication unit, the execution part of the multiply and mac instructions2 are
pipelined into two, steps giving a total of four pipeline steps for these instructions.
This might sound like a complicated solution, but as it turned out it could be
handled with little extra hardware. Conflicts can occur when a four step instruc-
tion is followed by a three step instruction that uses some of the same resources in
the third step as the four step instruction in its last step, but this is handled without
greater difficulties: The pipeline control unit monitors what kind of instruction is
currently executing, what the next instruction is and what resources these instruc-
tions use. It will then halt the pipeline for one clock cycle (by inserting a nop
instruction) when this is needed to avoid conflicts. An example of this is shown
in figure 9.1. In most cases it is possible to avoid these extra clock cycles by rear-
ranging the program code (so that a four step instruction is never followed directly
by a three step instruction that uses the same resources.)
Note: The organization of the register file allows one MAC unit instruction and
one other instruction to write to it in the same clock cycle, as long as they don’t
use exactly the same register.

9.2 Jumps and Branches


Instructions that changes the value of the program counter causes some problems
in a pipelined processor: When the instruction reaches the execution step of the
pipeline, the following instruction/s are already in the pipeline. This is usually
handled in one of two ways. Either the pipeline is flushed, that is the instructions
in the pipeline steps preceding the execution step, are “thrown away” and replaced
2
From here on mac in small letters refers to the multiply-and-accumulate instruction, while
MAC in capital letters refers to the MAC computational unit. Other instructions than mac (for
example multiplication) are executed on the MAC unit.

35
fetch: mac add1 add2 fetch: mac add1 add2 add2
decode: mac add1 add2 decode: mac add1 add1 add2
execute: mac add1 add2 execute: mac nop add1 add2
mac step 2: mac mac step 2: mac
a) No conflict b) Nop inserted

Figure 9.1: A mac instruction (four pipeline steps) is followed by two add instruc-
tions (three pipeline steps). In a) There are no problems. In b) an operand of the
first add is part of the result from the mac instruction, so a nop is inserted by the
processor to avoid error.

with nop operations. This means that every jump consumes one extra clock cycle
for every pipeline step before the execution step. The other solution is to use
delayed jumps. This means that the instructions that are already in the pipeline
are also executed. To the programmer it looks as if the jump is delayed by a
number of (typically two) instructions. This tends to make the program a bit more
difficult to follow and the possibility of having two jump instructions immediately
following each other has to be handled somehow.
This processor uses delayed jumps (for both conditional and unconditional
jumps, subroutine calls and return from subroutine instructions). Furthermore
an instruction that may cause a jump must always be followed by two non-jump
instructions.

9.3 Hardware Looping


This processor has two instructions for hardware looping: the simple ’repeat’ in-
struction that just repeats one instruction a number of times and the more com-
plex ’loop’ instruction that repeats two or more instructions and also allows nested
loops. The reason for having two different instruction, is that the pipeline makes
it difficult to handle very short loops (one or two instructions) in the same way as
longer loops. Many processors that use only one loop instruction, have special re-
strictions for short loops (for example they may have to be repeated at least some
minimum number of times).
The two instructions uses completely different hardware. The hardware for
the ’repeat’ instruction is basically just a counter, counting down for as long as an
instruction is repeated. The ’loop’ instruction is based on a loop stack, where start
and end addresses as well as loop counter values for up to four nested loops are

36
stored.
See section 15.4.2 for further information on hardware looping.
The complete control path is described in chapter 15.
Section 16.3 discusses restrictions to the use of some instructions due to pipeline
complications.

37
Chapter 10

RTL Implementation

10.1 Micro Architecture


When the top level Architecture is completed, the next step is to describe every
block on Register-Transfer level. This means making circuit diagrams consisting
of components like registers, multiplexers and arithmetical primitives (for exam-
ple adders). The principle for doing this is similar to that of designing the top level
architecture in that operations are mapped into the hardware of the design unit in
question one at the time, while trying to multiplex the hardware as far as possible.
The control signals to all multiplexers are named and a table describing which
control signals are used for every instruction is created. (This table is in fact
practically a truth table for the function of the instruction decoder.)

10.2 VHDL Implementation


The final step is to translate the whole processor into synthesizeable hardware de-
scription language code. The tools used for this was Renoir from Mentor Graphics
and the hardware description language VHDL.
Renoir can generate VHDL or Verilog code from block diagrams, truth tables,
state machines and flowcharts. It also has an interface to the simulation tool Mod-
elsim (from the same company), that was used for all verification of the VHDL
code, and many other features of which a few (like version management) were
used.
Mostly the block diagram entry method was used - basically (hierarchical)
block diagrams are created and the blocks at the lowest level are described in

38
VHDL code.
Synthesizeable VHDL code was generated for the whole processor core except
the memories, for which simple behavioral models where used.

39
Chapter 11

Verification

Verification is a major part of the hardware design work. It could be up to 80%


of the design time for a complex system. Deciding the verification strategy early
allows early development of the verification environment (test benches and so on),
which improves concurrent engineering possibilities. The verification flow has a
major influence on the whole design flow.

11.1 The Verification Strategy


A good verification strategy might be to focus on achieving a very high test cover-
age at block level and then focus on interconnections between blocks and corner
cases on the higher levels [1].
For clarity here follows some common verification related terminology:

Compliance Testing
Verifying that the design or part of the design follows its specification.

Corner Testing
Trying to find and test the most complex scenarios that are most likely to cause
errors.

40
Random Testing
Since it is usually impossible to find all corner cases, it can be useful to use a
setup that generates and tests random test vectors. This often generates strange
unanticipated corner cases.

Path Coverage
Path coverage is a measure of how many of all possible interconnections between
different components are tested. Normally a path coverage of 100% is required.

Branch Coverage
This is a measure of how many of all possible combinations of multiplexer inputs
are tested. Usually a branch coverage of 100% is needed at least at the lowest
block level.

11.2 Verification for This Project


Due to the limited time for this project the verification has not been as extensive
as it would have been in a “real” project.
Below follows a discussion about what verification has been done and what
would have been done if there had been more time.

11.2.1 Block Level Verification


The verification performed on block level is mainly compliance testing. However
the test vectors have been chosen to at least reach full branch coverage and full
path coverage. No corner or random testing was done at block level.

11.2.2 Instruction Level Verification


Most of the effort on instruction level verification went into corner testing of com-
putational instructions. However still a lot more effort could have been put into
finding corner cases if there had been time. There was actually some bugs re-
lated to corners, that were missed here but turned up during the following random
testing.

41
Program flow instructions where also tested rather extensively on instruction
level. The exception is the ’loop’ instruction which has a lot of strange special
cases that might cause problems, these where not all tested to the extent they
should have been, however most of them were tested quite thoroughly during the
block level testing of the program flow controller and PC-, loop- and repeat-stacks.

11.2.3 Random Testing


Testing was performed with random data, but not with random instructions. In
other words an assembler program was written that loads data from memory, ex-
ecutes different operations on this data and writes it back to memory. For every
execution new random input memory data and new random values for control
flags where generated. The program tested every mode of every computational
instruction, however program flow instructions where not tested.
In the last session, the random testing was run approximately 220 000 times
without finding any errors. That is every computational instruction was run with
approximately 220 000 different combinations of input data and control bit set-
tings. Although this is only a small fraction of all possible input data, the result
implies that the possibilities of finding additional errors within reasonable time
are quite small.
It would have been possible to also generate random instructions (just generate
random 32-bit words and throw away all that are not valid instruction words)
and this is usually done “in reality”, but it was considered to be a bit too time
consuming for this project.

11.2.4 Application Level Verification


Application level verification means running the sort of applications the processor
is intended to run “in reality”. This is to prove that the processor really can do
what it is intended to do in practice.
The application tested on this processor was a 30th order FIR-filter. This ap-
plication tests both the repeat and loop instructions, modulo addressing and the
parallel computation and memory access possibilities. It is also an example of
the type of convolution based algorithms that are very common in digital signal
processing. The program can be found in appendix B.
The filter used was a low-pass filter. The input and output can be seen in figure
11.1.

42
Input
1

0.5

0 100 200 300 400 500 600 700 800 900 1000
Output from matlab (64−bit floating point)
1

0.5

0 100 200 300 400 500 600 700 800 900 1000
Output from DSP (16−bit fixed point)
1

0.5

0 100 200 300 400 500 600 700 800 900 1000

Figure 11.1: Fir filter outputs from matlab and from DSP processor.

The program turned out to work very well. The difference to the result of
a Matlab implementation, using 64-bit floating point representation, was in the
same order as the precision possible with 16-bit fractional numbers ( 15 or ap- 2
proximately three units in the fifth decimal).

43
Chapter 12

Conclusions and Future


Improvements

This chapter summarizes results and conclusions from the project and presents
ideas for changes and future improvements of the processor. Many of these things
have already been mentioned in the previous chapters.

12.1 Results and Conclusions


On the whole the processor works well. It is fairly uncomplicated to program (at
least with a somewhat more advanced assembler software than what was written
for this project) and it has quite good performance for convolution based algo-
rithms. Performance for other algorithms has not been investigated due to lack of
time and limited knowledge in the area of DSP applications. The verification was
also rather limited, but everything that has been tested works

12.2 Alternative Solutions


This section summarizes some things that might have been implemented in other
ways.

The Accumulator Registers


As mentioned before the way the guard bits uses half of a general purpose register
feels a bit strange and maybe it would have been better to have the guards bits in

44
a separate register.

Choice of Source Accumulator Register


The way it works now, instructions using an accumulator register as source (MAC,
32-bit add and 32-bit shift), must use the same accumulator register both as source
and destination. The reason for this lies only in the instruction word (in other
words it is not because of the data path architecture) and it would have been quite
easy to allow both source and destination accumulator registers to be specified in
the instruction. The price for this would have been that the source register for
both multiplication operands, would have been restricted to use only half of the
32 general purpose registers (the second operand is already restricted to use only
registers 16 to 31).
Among other things this would have made it possible to execute operations on
an accumulator register value without loosing the old value and to copy the value
of one accumulator register to the other.

Shorter Instructions
As mentioned before there is very much “space left” in the instruction word and
some bits are almost not used at all. Even with half of the instruction space saved
for accelerator instructions, the instruction word could easily have been made at
least two bits shorter. However if standardized memories would be used and the
instruction word length therefore should be the traditional “multiple of eight”, 24
bits would be the next smaller step and that would hardly have been achievable
without further limitations to the instruction set.

12.3 Future Improvements


Here are some examples of things that has not been implemented at all yet.

12.3.1 Interrupts
Although not implemented for this processor yet, interrupt handling is necessary
to efficiently communicate with other hardware. Basically all DSP processors
handle interrupts, however the way in which it is done is often a bit simpler (and
quicker) than for general purpose processors.

45
Specifically for this processor, the support for hardware accelerators would
probably include some sort of interrupt.

12.3.2 I/O Ports


The processor supports no I/O yet (except maybe memory mapped) and some sort
of port interface should be added.

12.3.3 Additional Instructions


As previously stated, this processor is intended as a platform for hardware accel-
erators. This means that “application specific instructions” should be added and
therefore the “base” instruction set is quite simple. However some more general
instructions could be added. For example many DSP processors have instructions
to support division and square root calculations - operations which are quite com-
plicated to do without hardware support. Also simpler instructions like minimum
and maximum value calculations could be added.

12.3.4 Hardware Accelerator and Multiprocessor Support


As mentioned before, this processor is intended to eventually be used in a system
together with four other similar processors that should be able to share memory.
Although the instruction set supports this, the necessary hardware is not yet im-
plemented.
The situation is similar for the hardware accelerator support.

46
Part II

Design Specification

47
Chapter 13

Introduction

This second part of the thesis describes the architecture and instruction set of the
processor. It is not a complete specification, but should at least be enough for the
user of the processor.

13.1 Processor Features


The processor uses 32-bit instructions. The Instruction set is highly orthogonal,
but the number of instructions is not so large (about 60). There is a lot of “unused
space”for future additions. Particularly the processor is intended as a platform for
hardware accelerator units and there is room reserved for this in the “instruction
space”.
The processor has a 16-bit native data width and uses fixed-point number rep-
resentation.
It has a Multiply-and-accumulate unit consisting of a 32-bit multiplier, a 40-
bit accumulator (in other words 8 guard bits are used) and a 32-bit barrel shifter
for scaling and other purposes.
The processor supports some parallelism, as it is possible (under certain cir-
cumstances) to do up to two memory access operations and one computational
operation every clock cycle. Among other things, this makes it possible to exe-
cute convolution based algorithms, with one multiply-and-accumulate operation
per clock cycle.
Other features include support for zero overhead hardware looping and mod-
ulo and bit-reversed addressing.

48
13.2 Outline of This Part of the Thesis
This part of the thesis has the following chapters:

Chapter 14 Describes the architecture of the data path, the computational units,
registers and addressing.

Chapter 15 Gives an overview of how the control part of the processor architec-
ture works.

Chapter 16 Describes the instruction word and its subfields and lists the machine
code of all instructions. This chapter also contains information on some
restrictions that applies to the use of some instructions (mainly program
flow instructions)

49
Chapter 14

Data Path

14.1 Architecture Overview


The computational units of the processor are the following:
Arithmetic unit for addition, subtraction and other common arithmetic operations
Logic unit for bitwise ’and’, ’or’ ’xor’ and ’not’ operations.
Shift unit for arithmetic shift, logic shift and rotation operations.
MAC unit for multiplication and multiply-accumulate operations. The MAC can
also perform 32-bit arithmetic shift and 32-bit addition/subtraction.
The data path architecture can be seen in figure 14.1.
Operands are always taken from the 32x16 bit register file. All of the 32
registers can be used as general purpose registers for common arithmetic, shift
and logic operations, but most of them also have other functions. Particularly
six of these 32 registers are also used as 40-bit accumulator registers for MAC
operations. Some of the registers are also used as address registers or for other
address generation purposes.
Data exchange between computational units, register file and memories are
facilitated by the following busses:
Two 16-bit data busses DA and DB that provides operands for computational units
and data to memories.
Two 16-bit result buses RA and RB for sending result from computations and data
from memory to the register file.
One 40-bit accumulator register bus from register file to the MAC unit and one
40-bit bus from the MAC unit back to the register file.

50
TM

AA 16
Addr
gen AB 16 DM
RA

DA 16
Reg
Arith RB
DB 16

40

Shift

Logic

16 16

40
MAC

Figure 14.1: Data path architecture

At the most one 40-bit word and two 16-bit words can be written to the register
file in one clock cycle.
The architecture also has two address busses AA and AB so two memories
can be addressed simultaneously.

14.2 Arithmetic Unit


The arithmetic unit performs 16-bit addition, subtraction, absolute value and av-
erage value computations in one clock cycle. The first operand comes either from

51
DA or is immediate data from the instruction word. The second operand (if there
is one) is always from DB. Addition and subtraction is done with or without satu-
ration depending on the saturation mode control bit.
The Arithmetic unit can be seen in figure 14.2
DA DB 0

immediate operand

0 C 0 1
DA[15]

Add/Sub
Cin add/sub

for absolute value


SAT computation add/sub
is decided by the
sign of DA

Figure 14.2: The arithmetic unit.

14.3 Shift Unit


The shift unit performs 16-bit logic and arithmetic shift operations and rotation,
with or without intermediate carry, in one clock cycle. All operations are specified
as left shift operations. Right shift is accomplished by specifying a negative num-
ber of steps. The value to be shifted is always provided on DB. The number of
steps is either given by the five least significant bits of DA or by a 5-bit immediate
data value in the instruction word.
Figure 14.3 shows the shift unit.

14.4 Logic Unit


The logic unit performs bitwise ’and’, ’or’, ’xor’ and ’not’ operations between
16-bit words. For ’and, ’or’, and ’xor’ operations the first operand is either on DA

52
immediate operand
DB
DA

C
C
Shift

Figure 14.3: The shift unit.


DA DB
immediate operand

NOT AND OR XOR

Figure 14.4: The logic unit.

or immediate data from the instruction word and the second operand is on DB.
The single operand for the ’not’ operation is always on DA.
Figure 14.4 shows the logic unit.

14.5 MAC Unit


The MAC unit performs multiplication with or without rounding, and multiply-
and-accumulate operations in two clock cycles. It also performs 32-bit shift, 32-
bit addition/subtraction and round operations in one clock cycle. All operations
can be executed with or without saturation.
The multiplication uses integer or fractional number representation depending
on the fractional mode control bit (see 14.6.2).
The MAC unit consists of the following parts:

53
A 32-bit multiplier multiplying two 16-bit operands from DA and DB into a
32-bit result. Both operands can be taken as signed or unsigned values indepen-
dently. The result of the multiplication is stored in an internal pipeline register in
the MAC unit.
A 40-bit adder where the registered result of the multiplication, sign extended
with eight guard bits to a total of 40 bits, can be added to or subtracted from
one of the 40-bit accumulator registers. A 16- or 32-bit value from DA, or DA
concatenated with DB, can also be added or subtracted directly to an accumulator
register. The adder also facilitates rounding (see below).
A 32-bit barrel shifter which enables the value from the accumulator to be
arithmetically shifted before reaching the adder. The number of steps to shift is ei-
ther the six least significant bits of DA or 6-bit immediate data from the instruction
word (positive value for left shift and negative for right shift)
The MAC unit is shown in figure 14.5

14.5.1 Rounding
Rounding is executed by adding 1 to the 17:th bit position (i.e. bit 16) of the 40-
7
bit value, if the 16 least significant bits are larger than h F F F . This means that
the 24 most significant bits (16 bits plus 8 guard bits) of the result is the rounded
value and the 16 least significant bits are unaffected. (The equivalent operation
using decimal numbers would be to add one if the decimal part was greater than
or equal to 0.5 and then truncate the decimals)

14.5.2 Saturation Unit


All MAC unit operations can be performed with or without saturation. If satura-
tion is enabled, the result will be saturated to the smallest or largest possible 32-bit
values (hF F 80000000 007
and h F F F F F F F respectively), whenever the result
from the adder is smaller or larger than these values.

14.6 Register File


As mentioned before, the processor has a register file consisting of 32 16-bit reg-
isters. All registers are listed in table 14.1 All registers can be used as general
purpose registers for holding operands and results for computational operations.

54
DA DB

U/Signed U/Signed

0 DA DB 0

Mult

16 16
int frac
32 DA[5:0]
0 instr[10:5]
0

round

add/sub shift

sat
acc
40

16 16 8
low

high

guard

Figure 14.5: The MAC unit

55
GRP0/ARP0
GRP1/ARP1
GRP2/APR2
GRP3/ARP3
GRP4/ARP4
GRP5/ARP5
GRP6/ARP6
GRP7/ARP7/LOOP
GRP8/STEP0
GRP9/STEP1
GRP10/STEP2
GRP11/STEP3
GRP12/STEP4/TOP0
GRP13/STEP5/BOTTOM0
GRP14/STEP6/TOP1
GRP15/STEP7/BOTTOM1
GRP16/CONTROL
GRP17
GRP18
GRP19
GRP20
GRP21
GRP22
GRP23
GRP24
GRP25
GRP26/ACC0-low
GRP27/ACC0-high
GRP28/ACC0-guard
GRP29/ACC1-low
GRP30/ACC1-high
GRP31/ACC1-guard

Table 14.1: The register file.

56
GRP0 - GRP7 (ARP0 - ARP7) can also be used as address registers for ad-
dressing data and tap memory.
GRP7 (LOOP) is also used to hold the loop counter value during hardware
loops
GRP8 - GRP15 (STEP0 - STEP7) holds step lengths for updating the address-
ing registers during post increment addressing.
GRP11 - GRP 15 Also holds top (TOP0/TOP1) and bottom (BOTTOM0/BOTTOM1)
registers for modulo addressing.
GRP16 (CONTROL) holds the control bits and GRP26 - GRP31 (ACC0/ACC1)
the accumulator registers.
The architecture of the register file can be seen in figure 14.6.

14.6.1 The Accumulator Registers


The Register file has two 40-bit accumulator registers, ACC0 and ACC1, for stor-
ing results of MAC unit operations. ACC0 consists of GRP26 (holding the 16
least significant bits of the 40-bit accumulator register) GRP27 (bit 16 to 31 of
the accumulator register) and the 8 least significant bits of GRP28 (guard bits of
the accumulator register). In the same way ACC1 consists of GRP29 (low bits)
GRP30(high bits) and GRP31 (guard bits).
The data on the 40-bit bus to the MAC is always either ACC0 or ACC1 and
the data on the 40-bit bus from the MAC can only be written to either ACC0 or
ACC1.

14.6.2 The Control Register


The control register holds the following control bits (the five least significant bits
of the register):
BR M0 M1 S F

The BR (Bit Reverse) Control Bit


When the BR control bit is set the address from ARP0 is bit-reversed. See 14.7.3
below for more information on bit-reversed addressing.

57
loopstack

inc7

15

23

31
7

RA
RB
RA

RA
RB

RB
MAC_GUARD
RA
RB
inc6

14

22

30
6

RA
RB
RA

RA

RA
RB

RB

RB
MAC_HIGH
inc5

13

21

29
5

RA
RB
RA

RA

RA
RB

RB

RB
MAC_LOW
inc4

12

20

28
4

RA
RB
RA

RA

RA
RB

RB

RB MAC_GUARD
inc3

11

19

27
3

RA
RB
RA

RA

RA
RB

RB

RB

MAC_HIGH
inc2

10

18

26
2

RA
RB
RA

RA

RA
RB

RB

RB

MAC_LOW
inc1

17

25
1

9
RA

RA

RA

RA
RB

RB

RB

RB
inc0

16

24
8
BR
RA
RB

RA

RA

RA
RB

RB

RB

offset
0
AB

ACC0

ACC1
modulo

DB
DA
AA

Figure 14.6: The register file. inc0, inc1 etc is new values for post increment
= +
addressing, that is incx ARP x S T E P x. For inc0 and inc1 modulo updating
is also applied if modulo addressing is enabled, see figure 14.7. The block marked
modulo is used for offset modulo addressing. It is similar to the blocks calculating
inc0 and inc1. The block marked BR is for bitreversal.

58
TOP

STEP[15]
BOTTOM 0 delta[15]
OR
1 "modulo addressing disabled
ARP
delta 0
STEP[15] 0
1 inc
1
STEP
0

TOP 1

BOTTOM

Figure 14.7: Modulo updating of address registers. A circuit like this generates the
signals inc0 and inc1 in figure 14.6. The multiplexers controlling the calculation
of the delta value are controlled by the sign of the step value. If STEP is positive
= +
then delta ARP S T E P BOT T OM 1
and if STEP is negative delta =
T OP ARP ST EP 1 0 +
. If delta < then ARP S T E P is still in the modulo
= +
addressing area and inc ARP S T E P . Otherwise inc T OP delta for = +
ST EP > 0 0
and BOT T OM delta for S T E P < . If modulo addressing is not
enabled inc is always ARP S T E P . +
The MO and M1 (Modulo Addressing) Control Bits
M0 and M1 enables modulo addressing for the address registers ARP0 and ARP1
respectively. See 14.7.2 below for more information on modulo addressing.

The S (Saturation Mode) Control Bit


This control bit enables saturation mode for the arithmetic and shift units. Note
that saturation in the MAC unit is not affected by this control bit.

The F (Fractional Mode) Control Bit


When this control bit is enabled data words represent fractional numbers in the
range [-1 1[ instead of integers. Only multiplication operations are affected by
this.

59
14.7 Addressing
The processor should be able to address up to 64 kWord program memory(PM), 64
kWord tap memory(TM), 4x64 kWord data memory(DM0 - DM3, each belonging
to a different processor) and 64 kWord third memory(3M). However at this point
no 3M and only one DM have been implemented. When everything has been
implemented all references to TM in this thesis should be replaced by “TM or
3M” and almost all references to DM should be replaced by “DM0 - DM3”.

14.7.1 Addressing Modes


Here follows descriptions of all addressing modes.

Register Direct
The data is taken from a register, GRP0-GRP32, pointed out by the instruction.
Example: add GRP5 GRP6

Register Indirect With Post Increment


The data is taken from a memory address pointed out by an address register,
ARP0-ARP8, given in the instruction. The address register is then updated by
adding the value in the corresponding STEP register.
Example: loaddm ARP1++ GRP5

Immediate Address
The data is taken from a memory address from the instruction.
Example: loaddmi #hFF03 GRP4

Immediate Data
The data is taken from the instruction.
Example: addi #34 GRP5

Register Indirect With Offset


The data is taken from an address given by an address register plus an offset from
the instruction.

60
Example: loaddm ARP0 #4 GRP3

Note: Register indirect addressing without post increment can be achieved either
by using offset addressing with zero offset or by using Register indirect addressing
with post increment with the step length set to zero.

14.7.2 Modulo Addressing


Address registers ARP0 and ARP1 supports modulo addressing when the flags
M0/M1 are enabled in the control register. The top and bottom registers TOP0/TOP1
and BOTTOM0/BOTTOM1 are implied in the operation.
Modulo addressing mode affects both post increment and offset addressing.

14.7.3 Bit-Reversed Addressing


If the flag BR is set the address from address register ARP0 is bit-reversed (re-
gardless of which of the addressing methods in 14.7.1 is used), that is the least
significant bit of the register becomes the most significant bit of the address etc.
Example: For executing a 32 point FFT, first ARP0 could be loaded with
b000000000000000 and STEP0 with b0000100000000000. Then when memory is
accessed using post increment addressing the sequence of addresses used would
be:
b0000000000000000 = 0
b0000000000010000 = 16
b0000000000001000 = 8
b0000000000011000 = 24
b0000000000000100 = 4
:
b0000000000011111 = 31

14.8 The Status Register


The status register has the following status flags:
N Z C O
N - Negative Value
Z - Equal zero Value
C - Carry/borrow bit

61
O - Overflow has occurred

The status flags are used to generate conditions for branch instructions. The
following branch conditions are available:

Condition Flags
Greater than N=0 and Z=0
Greater or equal N=0
Equal Z=1
Less or equal N=1 or Z=1
Less than N=1
Not equal Z=0
Carry C=1
Not carry C=0
Overflow O=1
Not overflow O=0

See appendix A for information on which instructions affects which flags.


Note however especially that the borrow bit is inverted, that is C is set when un-
signed addition generates carry and when unsigned subtraction does not generate
borrow.

62
Chapter 15

Control Path

The control path of the processor consists of the following blocks:


The Program Memory (PM) contains the program.
The Instruction decoder (ID) decodes the control signals from the instruction
word.
The Program Counter (PC) Generates the address for the program memory.
The Program flow controller (PFC) controls the program counter. It also con-
tains the PC stack, the loop stack and the repeat register
The Pipeline Controller monitors the pipeline and halts the execution when a
conflict appears.
The Branch controller keeps track of jump conditions.

The control path is shown in figure 15.1

15.1 The Pipeline


The processor uses a variable length pipeline with three or four steps. Instructions
that incorporates a multiplication operation (’mult’, ’mac’) are executed in four
steps and all other instructions in three steps. The reason for this is that the MAC
unit is pipelined in two steps.
The first pipeline step is the instruction fetch. Here the instruction is fetched
from program memory and loaded into the instruction register.
In the second step, instruction decoding, the control signals are decoded from
the instruction word by the instruction decoder and stored in control registers.
The third step differs slightly between three- and four step instructions. For

63
pipe4

ctrl2
nop data
ID path

instr
PC

PM

ctrl
const
status

Pipeline ctrl Branch


PFC ctrl
(loop stack, pc stack, repeat reg)

loopreg

Figure 15.1: The control path

three step instructions operands are fetched, the operation is performed and the
result is written back to the register file. For four step instructions operands are
fetched, multiplication is executed and the product is written to the pipeline regis-
ter in the MAC unit.
In the fourth step of a four step instruction the value of an accumulator register
is fetched (if it is a mac instruction) the 40-bit addition/subtraction is executed and
the result is written back to the accumulator register.
The variable pipeline depth causes problems in some cases. These are handled
by the pipeline controller which is described in 15.5.
Appendix C has timing diagrams for the pipeline for some different program
flow cases.

15.2 Instruction Decoder


The instruction decoder decodes all control signals to the data path and some to
the control path. It also manages the variable pipeline depth. For this purpose the
control signals are stored in three registers:
The ctrl register stores all control signals that are always only used in the third
pipeline step.

64
The ctrl2 register stores signals that can be used either in step three or step four.
The pipe4 register is used during the third step to store control signals that will be
used in the fourth step.
A special control signal in the ctrl register controls whether the signals in ctrl2
is for the third or fourth step.

15.3 Program Counter


The program counter produces the address for the program memory. During nor-
mal execution, the PC is increased by one every clock cycle, but due to program
flow instructions and pipeline complications this can change. The next address
can be loaded either from the register file or from an immediate address in the in-
struction. It can also keep its old value to perform the ’repeat’ instruction or when
the pipeline has to be halted and it can be be loaded with the loop start value from
loop stack. Finally it can be loaded with the top value of the PC stack in case of a
subroutine return and is set to zero on reset.
The program counter is 16 bits wide, which means 64 kWords of program
memory can be addressed.

15.4 Program Flow Controller


This block controls the updating of the PC. It also monitors and controls hardware
looping (loop stack and repeat register) and subroutine calls (the PC stack).
The program flow controller is the most complicated and tricky part of the
whole processor and the design details will not be presented in this thesis. How-
ever the information below, regarding the PC stack and hardware looping, should
be enough for understanding how to program the processor.

15.4.1 Subroutine Calls - The PC Stack


When a subroutine call is made (the ’call’ instruction is executed) the program
counter is loaded with the starting address of the subroutine and the old program
counter address is pushed to the PC stack. When the program returns from the
subroutine (the ’rts’ instruction is executed) the PC stack is popped and the top
value is loaded to the program counter.
The PC stack depth is four, so up to four nested subroutine calls are possible.

65
15.4.2 Hardware Looping
The processor has two instructions for zero overhead hardware looping: The sim-
ple ’repeat’ instruction that repeats the following instruction a number of times
and the more complex ’loop’ instruction that repeats any number of instructions
larger than one.

Repeat
The ’repeat’ instruction is facilitated by the repeat register. During normal ex-
ecution this register holds the value one, every instruction is executed once and
the program counter is increased by one every clock cycle. However as soon as
the value of the repeat register is not one, the PC, the instruction register and the
control registers are no longer updated. Instead the repeat register is decreased by
one every clock cycle until its value is one again. The ’repeat’ instruction simply
loads a value (larger than one) into the repeat register thereby causing the next
instruction to be repeated the specified number of times.

Loop
The ’loop’ instruction works in a completely different way than the ’repeat’ in-
struction.
Before the ’loop’ instruction is executed the number of repetitions must be
loaded into the loop register (GRP7). The code section to be looped starts with
the instruction following the ’loop’ instruction and ends at an absolute address
specified in the instruction. When the ’loop’ instruction is executed, these two
program addresses and the value of the loop register are pushed to the loop stack.
When the PC reaches the address equal to the loop end address on the top
of the loop stack, the PC is set to the corresponding loop start address and the
corresponding loop counter value is decreased by one. The loop counter value is
then copied back to the loop register in the register file. In that way the current
loop counter value is accessible from the program. If the loop counter value is one
when the PC reaches the loop end, the loop stack is popped.
Since the loop stack depth is four, up to four nested loops are possible.
Note: due to pipeline complications some restrictions applies to how other
program flow instructions can be used with the ’loop’ and ’repeat’ instructions.
This information can be found in section 16.3

66
15.5 Pipeline Controller
The pipeline controller monitors the pipeline. By looking at what type of in-
struction is currently executing (if it is a three or four step instruction and which
accumulator register it uses) and which is the next instruction to be executed, it
determines if the pipeline has to be halted for one clock cycle, before the next
instruction is executed.
Halting the pipeline means in practice that the value of PC and instruction
registers are kept and control signals of a ’nop’ operation is loaded to the control
registers.
Halting the pipeline is necessary if a four step instruction is followed by a
three step instruction and one of the following is true:
1. The source register of the three step instruction is in the accumulator register
used by the four step instruction.
2. The three step instruction is dependent on status flags generated by the four
step instruction.
3. The three step instruction uses the MAC unit (but not the multiplier, because
then it would not be a three step instruction).

15.6 Branch Controller


The branch controller is a very small block whose only purpose is to keep track of
status flags and branch conditions and tell the PFC when to branch.

67
Chapter 16

Instruction Set

16.1 The Instruction Word


The processor uses an orthogonal, 32-bit instruction set. The instruction word is
dived into between four and eight subfields in one of the following ways:

1: Mux Op Mem Addr1 SReg Addr2 DReg

2: Mux Op Mem Addr1 SReg Addr2 A DReg

3: Mux Op Mem Addr1 S/DReg offset

4: Mux Op Address/Constant DReg

5: Mux Op Address/Constant A DReg

6: Mux Op Condition Prog Addr

68
Name Bit Description
Mux 31:27 Multiplexer switching between different instruction groups.
Op 26:22 Operation code choosing the actual operation.
Mem 21 1 for memory write operations, 0 for read.
20 1 if TM or 3M is used.
19 0 for TM, 1 for 3M.
18 1 if DM is used.
17:16 Selects DM memory bank 0-3.
Addr1 15:13 Address register for DM or 3M/TM addressing.
SReg 12:8 Source register.
Addr2 7:5 Address register for DM addressing when two parallel
memory operations are made.
DReg 4:0 Destination register.
A 4 Accumulator register, 0 for ACC0, 1 for ACC1
Address/
Constant 20:5 Immediate address or constant value.
Offset 7:0 Immediate offset address value.
Prog Addr 15:0 Immediate program address.

Note: As long as 3M and DM1-DM3 are not implemented, bits 16, 17 and 19 will
always be zero.

16.2 Parallel Memory Instructions


Under certain circumstances the instruction set allows memory operations to be
executed in parallel with other operations.
In all cases where a memory load is executed in parallel with a computational
operation, SReg and sometimes also DReg are used both as source for the compu-
tational operation and as destination for the memory operation.
In the assembly code, parallel instructions are always separated by a “,” with
space on both(!) sides. The different possibilities are described below.

16.2.1 Move to Memory:


Two moves from register to memory, both using register indirect addressing with
post increment, can be executed in parallel. The first move must use tm and the
second one must use dm.

69
Example: move2tm ARPx++ GRPx , move2dm ARPy++ GRPy

16.2.2 MAC Operation and Load


Up to two load instructions, using register indirect addressing with post increment,
can be executed in parallel with any MAC unit operation (mpy[u][s], mac[sub][u][s],
rnd, sat, clracc, addacr, subacr, add32 or sub32).
If one load is parallel with a MAC operation, the destination register for the
load must be the same as the first operand register for the MAC operation, if it has
any operand registers (for example ’rnd’ has no operands so any register can be
loaded).
If two loads are parallel with a MAC operation the first load must use tm
and its destination must be the same as the first operand of the MAC operation.
The second load must us dm and its destination must be the same as the second
operand register of the MAC operation (if it has two operands). The destination
for the second load must also always be one of GRP16 - GRP31.
It is also possible to execute the two load operations in parallel without the
MAC operation. In that case any register can be used.
Example 1: mac GRPx GRPy ACCz , loadtm ARPx++ GRPx , loaddm
ARPy++ GRPy (where GRPy must be one of GRP16 - GRP31)
Example 2: loadtm ARPx++ GRPx , loaddm ARPy++ GRPy (GRPy can be
any register)

16.2.3 Arithmetic, Logic, Shift or Move Operation and Load


One load instruction, using register indirect addressing with post increment, can
be executed in parallel with any arithmetic, logic or shift operation using regis-
tered operands (or in other words: all arithmetic, logic and shift operations that
do not use immediate data) or with the ’move’ instruction. The destination for the
load must be the same as the first/source operand of the other operation.
Example: add GRPx GRPy , loaddm ARPx++ GRPx

16.3 Instruction Set Restrictions


Due to pipeline complications there are some restrictions to when some instruc-
tions can be used, these are listed below.

70
16.3.1 Branch and Jump Instructions
This processor uses delayed jump, branch, subroutine call and subroutine return
instructions. In other words the two instructions following the jump/branch etc is
always executed whether the jump is taken or not. If one of these two instructions
was also a jump instruction, or for example a loop instruction, that would cause
complications, therefore ’jmp’, ’bra’, ’call’ and ’rts’ instructions must always be
followed by two instructions that are not program flow instructions.

16.3.2 Hardware Loops


The ’repeat’ Instruction
The only restriction that applies to the ’repeat’ instruction is that the repeated
instruction may not be a program flow instruction.

The ’loop’ Instruction


The more complex ’loop’ instruction has the following restrictions:
1. The loop must consist of at least two instructions (otherwise use ’repeat’).
2. The two last instructions of the loop must not be program flow instructions.
3. Two nested loops may not end at the same address.
4. No more than four nested loops are allowed.

16.3.3 Modulo Addressing


The implementation of modulo addressing does not allow the address to “wrap
around” the modulo addressing area more than once. This results in the following
restrictions:
1. When using modulo addressing and post increment, the step size should not be
larger than the modulo addressing area (that is S T E P x < BOT T OM x T OP x
should hold).
2. When using modulo addressing with offset addressing the offset should not be
larger than the modulo addressing area.

71
16.4 Instruction Encoding
This section describes the machine code of every instruction. The letters repre-
senting different subfields in the table have the following meanings:

A = Address register
C = Constant data, address or offset
c = Condition
D = Destination Register
M = Memory use
P = Program address register
r = Round accumulator
S = Source Register
s = Saturate accumulator
Y = Accumulator register
X = Occupied
- = Don’t care

Mux OpCode Memory Addr1 SReg Addr2 DReg Instruction


(31:27) (26:21) (20:16) (15:13) (12:8) (7:5) (4:0)

Data move instructions:


00000 00000 0 0-0– — —– — —– nop
00000 00000 0 MMMMM AAA DDDDD AAA DDDDD loadtm ARPx++ GRPx , loaddm ARPy++ GRPy
00000 00000 1 MMMMM AAA DDDDD AAA DDDDD move2tm ARPx++ GRPx , move2dm ARPy++ GRPy
00000 00001 0 0-1MM AAA DDDDD CCC CCCCC loaddm (ARPx + #offset) GRPx
00000 00001 0 1M0– AAA DDDDD CCC CCCCC loadtm (ARPx + #offset) GRPx
00000 00001 1 0-1MM AAA DDDDD CCC CCCCC move2dm (ARPx + #offset) GRPx
00000 00001 1 1M0– AAA DDDDD CCC CCCCC move2tm (ARPx + #offset) GRPx
00000 01011 0 CCCCC CCC CCCCC CCC DDDDD loaddmi GRPx #addr
00000 01011 1 CCCCC CCC CCCCC CCC DDDDD movedmi GRPx #addr
ALU instructions, arithmetic unit:
00010 00000 0 0-0– — SSSSS — DDDDD abs GRPx GRPy
00010 00000 0 0-1MM AAA SSSSS — DDDDD abs GRPx GRPy , loaddm ARPx++ GRPx
00010 00000 0 1M0– AAA SSSSS — DDDDD abs GRPx GRPy , loadtm ARPx++ GRPx
00010 00001 0 MMMMM AAA SSSSS — DDDDD add GRPx GRPy
00010 00010 0 MMMMM AAA SSSSS — DDDDD addc GRPx GRPy
00010 00011 0 CCCCC CCC CCCCC CCC DDDDD addi GRPx GRPy
00010 00100 0 MMMMM AAA SSSSS — DDDDD avg GRPx GRPy
00010 00101 0 MMMMM AAA SSSSS — SSSSS comp GRPx GRPy
00010 00110 0 MMMMM AAA SSSSS — DDDDD neg GRPx GRPy
00010 00111 0 MMMMM AAA SSSSS — DDDDD sub GRPx GRPy
00010 01000 0 MMMMM AAA SSSSS — DDDDD subc GRPx GRPy
00010 01001 0 CCCCC CCC CCCCC CCC DDDDD subi GRPx GRPy
00010 01010 0 0-0– — SSSSS — DDDDD move GRPx GRPy
00010 01010 0 1M0– AAA SSSSS — DDDDD move GRPx GRPy , loadtm ARPx++ GRPx
00010 01010 0 0-1MM AAA SSSSS — DDDDD move GRPx GRPy , loaddm ARPx++ GRPx

72
00010 01011 0 0-0– — SSSS- — DDDD- move32 GRPx GRPy
00010 01100 0 CCCCC CCC CCCCC CCC DDDDD load #data GRPy
ALU instructions, logic unit:
00100 00000 0 MMMMM AAA SSSSS — DDDDD and GRPx GRPy
00100 00001 0 CCCCC CCC CCCCC CCC DDDDD andi GRPx GRPy
00100 00010 0 MMMMM AAA SSSSS — DDDDD or GRPx GRPy
00100 00011 0 CCCCC CCC CCCCC CCC DDDDD ori GRPx GRPy
00100 00100 0 MMMMM AAA SSSSS — DDDDD xor GRPx GRPy
00100 00101 0 CCCCC CCC CCCCC CCC DDDDD xori GRPx GRPy
00100 00110 0 MMMMM AAA SSSSS — DDDDD not GRPx GRPy
ALU instructions, shift unit:
00110 00000 0 MMMMM AAA SSSSS — DDDDD asl GRPx GRPy
00110 00001 0 —– — —CC CCC DDDDD asli GRPx GRPy
00110 00010 0 MMMMM AAA SSSSS — DDDDD lsl GRPx GRPy
00110 00011 0 —– — —CC CCC DDDDD lsli GRPx GRPy
00110 00100 0 MMMMM AAA SSSSS — DDDDD rsl GRPx GRPy
00110 00101 0 —– — —CC CCC DDDDD rsli GRPx GRPy
00110 00110 0 MMMMM AAA SSSSS — DDDDD rslc GRPx GRPy
00110 00111 0 —– — —CC CCC DDDDD rslci GRPx GRPy
MAC instructions:
01000 00000 0 MMMMM AAA SSSSS AAA YSSSS mpy GRPx GRPy ACCx
01010 00000 0 MMMMM AAA SSSSS AAA YSSSS mpy GRPx GRPy ACCx SAT
01001 00000 0 MMMMM AAA SSSSS AAA YSSSS mpy GRPx GRPy ACCx RND
010sr 00001 0 MMMMM AAA SSSSS AAA YSSSS mpyu GRPx GRPy ACCx [SAT/RND]
010sr 00010 0 MMMMM AAA SSSSS AAA YSSSS mpysu GRPx GRPy ACCx [SAT/RND]
010sr 00011 0 MMMMM AAA SSSSS AAA YSSSS mpyus GRPx GRPy ACCx [SAT/RND]
010s0 00100 0 MMMMM AAA SSSSS AAA YSSSS mac GRPx GRPy ACCx [SAT]
010s0 00101 0 MMMMM AAA SSSSS AAA YSSSS macu GRPx GRPy ACCx [SAT]
010s0 00110 0 MMMMM AAA SSSSS AAA YSSSS macsu GRPx GRPy ACCx [SAT]
010s0 00111 0 MMMMM AAA SSSSS AAA YSSSS macus GRPx GRPy ACCx [SAT]
010s0 01000 0 MMMMM AAA SSSSS AAA YSSSS mac GRPx GRPy ACCx [SAT]
010s0 01001 0 MMMMM AAA SSSSS AAA YSSSS macu GRPx GRPy ACCx [SAT]
010s0 01010 0 MMMMM AAA SSSSS AAA YSSSS macsu GRPx GRPy ACCx [SAT]
010s0 01011 0 MMMMM AAA SSSSS AAA YSSSS macus GRPx GRPy ACCx [SAT]
010s1 01100 0 MMMMM AAA DDDDD AAA YDDDD rnd ACCx [SAT]
01010 01100 0 MMMMM AAA DDDDD AAA YDDDD sat ACCx
01000 01101 0 MMMMM AAA DDDDD AAA YDDDD clracc ACCx
010sr 01110 0 MMMMM AAA SSSSS AAA YDDDD addacc GRPx ACCx [SAT]
010sr 01111 0 MMMMM AAA SSSSS AAA YDDDD subacc GRPx ACCx [SAT]
010sr 10000 0 MMMMM AAA SSSSS AAA YDDDD add32 GRPx ACCx [SAT]
010sr 10001 0 MMMMM AAA SSSSS AAA YDDDD sub32 GRPx ACCx [SAT]
010sr 10010 0 MMMMM AAA SSSSS AAA Y—– lshl GRPx ACCx [SAT/RND]
010sr 10011 - —– — –CCC CCC Y—– lshli GRPx ACCx [SAT/RND]
Program flow instructions:
01100 00000 0 cccc- — —– — PPPPP bracond GRPx
01100 00001 0 cccc- CCC CCCCC CCC CCCCC bracond #addr
01100 00010 0 —– — —– — PPPPP jmp GRPx
01100 00011 0 —– CCC CCCCC CCC CCCCC jmp #addr
01100 00100 0 —– — —– — PPPPP call GRPx
01100 00101 0 —– CCC CCCCC CCC CCCCC call #addr
01100 00111 0 —– CCC CCCCC CCC CCCCC loop #addr
01100 01001 0 —– — CCCCC CCC —– repeat #data
01100 01010 0 —– — —– — —– rts
Accelerator instructions:
1XXXXXXXXX XXXXXX XXX XXXXX XXX XXXXX Accelerator instructions.

73
Bibliography

[1] Dake Liu, Design an embedded digital signal processor, LiTH.

[2] Phil Lapsley, Jeff Bier, Amit Shoham, Edward A. Lee, DSP Processor Fun-
damentals, IEEE Press, 1995.

[3] Berkeley Design Technology, Inc, Evaluating DSP Processor Performance,


2000.

[4] David A. Patterson, John L. Hennessy, Computer Organization & Design -


the hardware/software interface (second edition), Morgan Kaufman, 1998.

[5] Lars Wanhammar, DSP Integrated Circuits, Academic press, 1999.

74
Appendix A

Instruction set summary

This chapter has complete descriptions of all instructions of the processor. This
includes:
Type of instruction - Instruction group, short description.
Syntax - What the assembly code looks like, addressing modes.
Operands - What data the instruction can use.
Execution - What the instruction does (“mathematically”)
Description - Description of the behaviour of the instruction, and which registers
and statusflags are affected.
Example - A short exemple of use of the instruction.

75
abs
Type of instruction
Arithmetic instruction - absolute value

Syntax
Register direct: abs GRPx, GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
jGRP x j! GRP y

Description
The absolute value of register GRPx is stored in register GRPy. The flags are not
updated.

Example
abs GRPx, GRPy

Register/Memory Before After


GRPx hFF12 hFF12
GRPy h0020 h00EE

76
add
Type of instruction
Arithmetic instruction - addition.

Syntax
Register direct without carry: add GRPx, GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP x + GRP y ! GRP y

Description
The values in register GRPx and GRPy are added and the result is stored in regis-
ter GRPy. The flags N, Z, C and O are updated. C is set when unsigned addition
generates carry. O is set when signed addition generates overflow.

Example
add GRPx, GRPy

Register/Memory Before After


Status Reg b0010 b0000
GRPx h0012 h0012
GRPy h0020 h0032

77
addc
Type of instruction
Arithmetic instruction - addition with carry in.

Syntax
Register direct with carry: addc GRPx, GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP x + GRP y + !
C GRP y

Description
The values in register GRPx and GRPy and the value of the flag C are added and
the result is stored in register GRPy. The flags N, Z, C and O are updated. C is set
when unsigned addition generates carry. O is set when signed addition generates
overflow.

Example
addc GRPx, GRPy

Register/Memory Before After


Status Reg b0010 b0000
GRPx h0012 h0012
GRPy h0020 h0033

78
addi
Type of instruction
Arithmetic instruction - addition with immediate data

Syntax
Immediate data without carry: addi #Data GRPy

Operands
h8000  Data  7
h FFF

GRPy: GRP0 - GRP31

Execution
Data + GRP y ! GRP y

Description
The value in register GRPy and the Data value are added. The result is stored
in register GRPy. The flags N, Z, C and O are updated. C is set when unsigned
addition generates carry. O is set when signed addition generates overflow.

Example
addi #h1234 GRPy

Register/Memory Before After


GRPy h0020 h1254

79
and
Type of instruction
Logic isntruction - bitwise and.

Syntax
Register direct: and GRPx, GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP x & GRP y ! GRP y

Description
Bitwise and between the values in register GRPx and GRPy. The result is stored
in register GRPy. The flags N and Z are updated.

Example
and GRPx GRPy

Register/Memory Before After


Status Reg b1010 b0010
GRPx h0012 h0012
GRPy h8010 h0010

80
andi
Type of instruction
Logic instruction - bitwise and with immediate data

Syntax
Immediate data: andi #Data GRPy

Operands
h8000  Data  7
h FFF

GRPy: GRP0 - GRP31

Execution
Data & GRP y ! GRP y

Description
Bitwise and between the value in register GRPy and the Data. The result is stored
in register GRPy. The flags N and Z are updated.

Example
andi #hFF GRPy

Register/Memory Before After


Status Reg h1000 h0000
GRPy h8020 h0020

81
asl
Type of instruction
Shift instruction - arithmetic shift

Syntax
Register direct: asl GRPx, GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP y >> (GRP x & 001 ) !
h F GRP y

Description
If the value in GRPx is positive the value in register GRPy is shifted GRPx steps
to the left. If the value in GRPx is negative the value in GRPy is arithmeticaly
shifted -GRPx steps to the right. The result is stored in register GRPy. The flags
N, Z, C and O are updated. O is set if overflow occurs on a left shift. C is the last
bit shifted out on a right shift.

Example
asl GRPx GRPy

Register/Memory Before After


Status Reg b0001 b0010
GRPx h0013 h0013
GRPy h9F22 hFFFC

82
asli
Type of instruction
Shift instruction - arithmetic shift with immediate data.

Syntax
Immediate data: asli #Step, GRPy

Operands
15  S tep  15
GRPy: GRP0 - GRP31

Execution
GRP y >> S tep ! GRP y

Description
If the value Step is positive the value in register GRPy is shifted GRPx steps to the
left. If the value in GRPx is negative the value in GRPy is arithmeticaly shifted
-GRPx steps to the right. The result is stored in register GRPy. The flags N, Z,
C and O are updated. O is set if overflow occurs on a left shift. C is the last bit
shifted out on a right shift.

Example
asli #-4 GRPy

Register/Memory Before After


Status Reg b1000 b1000
GRPy hff22 hfff2

83
avg
Type of instruction
Arithmetic instruction - average value.

Syntax
Register direct: avg GRPx, GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRPx+GRPy !
2 GRP y

Description
The average value of the value in register GRPx and in register GRPy is stored in
register GRPy. The flags N and Z are updated.

Example
avg GRPx GRPy

Register/Memory Before After


Status Reg b0000 b0000
GRPx h0023 h0023
GRPy h0020 h0021

84
brafcondg
Type of instruction
Program flow instruction - conditional jump.

Syntax
Register direct: brafcondg GRPx
Immediate PC address: brafcondg #Addr

Operands
h0000  Addr  hF F F F

GRPx: GRP0 - GRP31

Execution
if f
cond g is T RU E

GRP x ! PC

or

Addr ! PC

else

PC +1! PC

Description
A conditional branch jump. Either register or constant based.
The jump is delayed two cycles, that is the two instructions following the branch
instruction are executed either the branch is taken or not. None of the two fol-
lowing instructions may be bra, call, rts, loop or repeat instructions. Bra may not
be used as a repeat instruction or as one of the two last instructions in a hardware
loop. No flags are updated.

fcondg Relation Flag status


gt GRP x > GRP yZ=0 and N=0

ge GRP x  GRP yN=0

lt GRP x < GRP yN=1

le GRP x  GRP yZ=1 or N=0

eq GRP x =
GRP yZ=1

ne GRP x 6 =
GRP yZ=0

c carry C=1

85
nc notcarry C=0
o overf low O=1
no notoverf low O=0

Example
bragt #h30

Register/Memory Before After


Status Reg b0000 b0000
PC h0100 h0030

braeq GRPx

Register/Memory Before After


Status Reg b0000 b0000
GRPx h0010 h0010
PC h0012 h0013

86
call
Type of instruction
Program flow instruction - subroutine jump

Syntax
Immediate PC adress: call #Addr
Register direct: call GRPx

Operands
h0000  Addr  7
h FFF

Execution
PC ! P C stack

Addr ! PC

Description
A subroutine jump

The jump is delayed two cycles, that is the two instructions following the call
instruction are executed before the jump is taken. None of the two following in-
structions may be bra, call, rts, loop or repeat instructions. Call may not be used
as a repeat instruction or as one of the two last instructions in a hardware loop.
No flags are updated.

Example
call #h3000

Register/Memory Before After


PC h0012 h3000

87
clracc
Type of instruction
MAC instruction - clear accumulator

Syntax
Register direct: clracc ACCx

Operands
AC C x : AC C 0; AC C 1
Execution
0! AC C x

Description
Clear accumulator register.
No flags are updated.

Example
clracc ACC0

Register/Memory Before After


ACC0 hxxxxxxxxxx h0000000000

88
comp
Type of instruction
Arithmetic instruction - compare two values.

Syntax
Register direct: comp GRPx GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP y GRP x ! N one

Description
The value in register GRPx is subtracted from the value in register GRPy, but the
result is not stored. The flags are updated. C is set when unsigned subtraction
does not generate borrow. O is set when signed subtraction generates overflow.

Example
comp GRPx GRPy

Register/Memory Before After


Status Reg b0000 b0100
GRPx hFF12 hFF12
GRPy hFF12 hFF12

89
jmp
Type of instruction
Program flow instruction - jump

Syntax
Register direct: jmp GRPx
Immediate PC adress: jmp #Addr

Operands
h0000  Addr  hF F F F

Execution
Addr ! PC

Description
Unconditional jump.
The jump is delayed two cycles, that is the two instructions following the jmp
instruction are executed before the jump is taken. None of the two following in-
structions may be bra, call, rts, loop or repeat instructions. Jmp may not be used
as a repeat instruction or as one of the two last instructions in a hardware loop.
No flags are updated.

Example
jmp #h3000

Register/Memory Before After


PC h0012 h3000

90
load
Type of instruction
Data move instruction. Load register with immediate data

Syntax
Immediate data: load #Const GRPy

Operands
h8000  C onst  7
h FFF

GRPy: GRP0 - GRP31

Execution
C onst ! GRP y

Description
The constant, Const, is loaded into the register GRPy. No flags are updated.

Example
loadi #h2034 GRPy

Register/Memory Before After


GRPy h1210 h2034

91
loadtm, load3m, loaddm, loaddm0, loaddm1,
loaddm2, loaddm3
Type of instruction
Memory instruction - Load register from memory

Syntax
Register indirect with postincrement: loadXmX ARPx++ GRPy
Register indirect with offset address: loadXmX ARPx #Offset GRPy

Operands
h0 Of f set  hF F

ARPx: ARP0 - ARP8


GRPy: GRP0 - GRP31

Execution
DM 0(# Addr )! GRP y

Description
The value stored at the address ARPx or ARPx + Offset in the specified memry
is copied to register GRPy. loaddm is equivalent with loaddm0. The memory bits
decides wich memory is used.
If adressing with postincrement is used ARPx is increased with the value in the
STEPx register.
The flags N and Z are updated.

Example
loaddm ARPx++, GRPy

Register/Memory Before After


ARPx h0200 h0201
GRPy hFF12 h1234
DM(h0200) h1234 h1234

loaddm ARPx h2, GRPy

92
Register/Memory Before After
ARPx Register h0200 h0200
GRPy hFF12 h2222
DM(h0200) h0000 h0000
DM(h0201) h1111 h1111
DM(h0202) h2222 h2222
DM(h0203) h3333 h3333
DM(h0204) h4444 h4444

93
loaddmi
Type of instruction
Memory instruction - Load register, immediate adress

Syntax
Immediate adress: loadmi #Addr GRPx

Operands
h0 Of f set  hF F

ARPx: ARP0 - ARP8


GRPy: GRP0 - GRP15

Execution
DM 0(# Addr )! GRP x

Description
The value stored at the address #Addr in dm0 is copied to the register GRPx. The
flags are not updated. Note that there is no equivalent function for any other mem-
ories than dm0.

Example
loaddmi #hFF00 GRPy

Register/Memory Before After


GRPy hFF12 h1234
DM(hFF00) h1234 h1234

94
loop
Type of instruction
Program flow instruction - hardware loop

Immediate PC adress: loop #Addr

Operands
h0000  Addr  hF F F F

LOOP register

Execution
PC +1! Loopstartstack

Addr ! Loopendstack

LOOP register ! Loopcounterstack

Description
The instructions between the loop instruction and the PC adress Addr (including
that address) is repeated a number of times specified by the value in the LOOP
register.
Up to four nested loops are possible, however two loops may never end at the
same adress. The loop must have at least two instructions (otherwise repeat is
used) and the last two instructions in a loop must not be jmp, bra, call or repeat.

Example
loadi #30 LOOP
loop #2000

Instructions from program addresses PC+1 to 2000 will be looped 30 times

95
l shl
Type of instruction
MAC instruction - 32-bit shift

Syntax
Register direct: l shl GRPx ACCy

Operands
GRPx: GRP0 - GRP31
ACCy: ACC0, ACC1

Execution
GRP y << (GRP x & 3 )!
h F GRP y

Description
If the value in GRPx is positive the value in accumulator GRPy is shifted GRPx
steps to the left. If the value in GRPx is negative the value in ACCy is arithmeti-
cally shifted -GRPx steps to the right. The result is stored in register ACCy. The
flags N, Z, C and O are updated. O is set if overflow occurs on a left shift. C is
the last bit shifted out on a right shift.

Example
l shl GRPx ACCy

Register/Memory Before After


GRPx h0008 h0008
ACCy hxx0000FF22 hxx00FF2200

96
l shli
Type of instruction
MAC instruction. 32-bit shift with immediate data

Syntax
Immediate data: l shli #Steps ACCy

Operands
32  S teps  31
AC C y : AC C 0; AC C 1
Execution
GRP y << S teps ! GRP y

Description
If the value Steps is positive the value in accumulator GRPy is shifted Steps steps
to the left. If Steps is negative the value in ACCy is arithmetically shifted -Steps
steps to the right. The result is stored in register ACCy. The flags N, Z, C and O
are updated. O is set if overflow occurs on a left shift. C is the last bit shifted out
on a right shift.

Example
l shl #12 ACCy

Register/Memory Before After


ACCy hxx0000FF22 hxx0FF22000

97
lsl
Type of instruction
Shift instruction - logical shift.

Syntax
Register direct: lsl GRPx, GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP y << (GRP x & 1 )!
h F GRP y

Description
If the value in GRPx is positive the value in register GRPy is shifted GRPx steps
to the left. If the value in GRPx is negative the value in GRPy is logically shifted
-GRPx steps to the right. The result is stored in register GRPy. The flags N, Z,
C and O are updated. O is set if overflow occurs on a left shift. C is the last bit
shifted out on a right shift.

Example
lsl GRPx GRPy

Register/Memory Before After


Status Reg b1000 b0010
GRPx hFFFE hFFFE
GRPy hFF22 h3FC8

98
lsli
Type of instruction
Shift instruction - logical shift with immediate data.

Syntax
Register direct: lsli #Step, GRPy

Operands
15  S tep  15
GRPy: GRP0 - GRP31

Execution
GRP y >> S tep ! GRP y

Description
If the value Step is positive the value in register GRPy is shifted Step steps to the
left. If the value in GRPx is negative the value in GRPy is logicaly shifted -Step
steps to the right. The result is stored in register GRPy. The flags N, Z, C and O
are updated. O is set if overflow occurs on a left shift. C is the last bit shifted out
on a right shift.

Example
lsli #4 GRPy

Register/Memory Before After


Status Reg b0000 b1000
GRPy hFF22 hF220

99
mac, macu, macus, macs, macsub, mac-
subu, macsubus, macsubsu
Type of instruction
Mac instruction - multiply and accumulate

Syntax
Register direct: mac[sub][u/su/us] GRPx GRPy ACCz [SAT]

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
ACCz: ACC0, ACC1

Execution
AC C z + GRP x  GRP y ! AC C z

AC C z GRP x  GRP y ! AC C z

Description
The value of register GRPx is multiplied by the value of register GRPy and the
product is added to (mac, macu, macus, macsu) or subtracted from (macsub, mac-
subu, macsubus, macsubsu) the accumulator ACCz. mac and macsub executes
a signed multiplication and macu and macsubu an unsigned multiplication. ma-
cus/macsu and macsubus/macsubsu considers the first or the second operand to be
unsigned respectively. If SAT is added the accumulator will be saturated after the
accumulation. The status flags N, Z and O are updated.

Example
mac GRPx GRPy ACC0

Register/Memory Before After


GRPx h0002 h0002
GRPy h0003 h0003
ACC0 h0000001000 h0000001006

100
move
Type of instruction
Data move instruction - move between registers

Syntax
Register direct: move GRPx GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP x ! GRP y

Description
The value of register GRPx is copied to register GRPy. No flags are updated.

Example
move GRPx GRPy

Register/Memory Before After


GRPx hFF12 hFF12
GRPy h1010 hFF12

101
move2tm, move23m, move2dm, move2dm#
Type of instruction
Memory instruction - write to memory

Syntax
Register indirect with postincrement: move2ftm/3m/dm[x]g ARPx++ GRPx
Register indirect with offset address: move2ftm/3m/dm[x] ARPx #Offset GRPx

Operands
h0 Of f set  hF F

ARPx: ARP0 - ARP7


GRPx: GRP0 - GRP31

Execution
GRP x ! (
DM ARP x )
GRP x ! (
DM ARP x + Of f set )
Description
The value of register GRPx is copied to the specified memory adress. No flags are
updated.

Example
move2dm ARPx++ GRPx

Register/Memory Before After


GRPx hff12 hff12
ARPx h0200 h0201
DM(h0200) h1234 hff12

move2dm ARPx #h2 GRPx

Register/Memory Before After


GRPx hFF12 hFF12
ARPx h0200 h0200

102
DM(h0200) h1234 h1234
DM(h0201) h1234 h1234
DM(h0202) h1234 hFF12
DM(h0203) h1234 h1234
DM(h0204) h1234 h1234

103
movedmi
Type of instruction
Memory instruction - write to memory, immediate adress

Syntax
Adress direct: movedmi #Addr GRPx

Operands
h0 Of f set  hF F

ARPx: ARP0 - ARP8


GRPy: GRP0 - GRP15

Execution
GRP x ! DM 0( Addr )
Description
The value stored in register GRPx is copied to the address Addr in dm0. The flags
are not updated. Note that there is no equivalent function for any other memories
than dm0.

Example
movedmi #hFF00 GRPy

Register/Memory Before After


GRPy hFF12 hFF12
DM(hFF00) h1234 hFF12

104
mpy, mpyu, mpyus, mpysu
Type of instruction
Mac instruction - multiplication

Syntax
Register direct: mpy[u/su/us] GRPx GRPy ACCz [SAT/RND]

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
ACCz: ACC0, ACC1

Execution
GRP x  GRP y ! AC C z

Description
The value of register GRPx is multiplied by the value of register GRPy and the
product is placed in the accumulator ACCz. mpy executes a signed multiplica-
tion and mpyu an unsigned multiplication. mpyus/mpysu considers the first or the
second operand to be unsigned respectively. If SAT is added the result will be
saturated and if RND is added the result will be rounded. The status flags N and
Z are updated.

Example
mpy GRPx GRPy ACC0

Register/Memory Before After


GRPx h0002 h0002
GRPy h0003 h0003
ACC0 h1000 h0006

105
neg
Type of instruction
Arithmetic instruction - negate value.

Syntax
Register direct: neg GRPx GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP x ! GRP y

Description
The value in register GRPx is negated and stored in register GRPy. The flags are
not updated.

Example
neg GRPx GRPy

Register/Memory Before After


GRPx h0012 h0012
GRPy h0020 hFFEE

106
nop
Type of instruction
Program flow instruction - no operation

Syntax
No operands: nop

Operands

Execution
PC +1! PC

Description
This instruction only affects the PC and is used to create execution delays.

Example
nop

Register/Memory Before After


PC h8020 h8021

107
not
Type of instruction
Logic instruction - invert register.

Syntax
Register direct: not GRPx GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
(
inv GRP x )! GRP y

Description
The value in register GRPx is inverted bitwise and stored in register GRPy. The
flags are not updated.

Example
not GRPx GRPy

Register/Memory Before After


GRPx h0012 h0012
GRPy h0020 hFFED

108
or
Type of instruction
Logic instruction - bitwise or.

Syntax
Register direct: or GRPx GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP x j GRP y ! GRP y

Description
Bitwise or between the values in register GRPx and GRPy. The result is stored in
register GRPy. The flags N and Z are updated.

Example
or GRPx GRPy

Register/Memory Before After


Status Reg b0000 b1000
GRPx h0012 h0012
GRPy h8020 h8032

109
ori
Type of instruction
Logic instruction - bitwise or with immediate data.

Syntax
Immediate data: ori #Data GRPy

Operands
h8000  Data  7 h FFF

GRPy: GRP0 - GRP31

Execution
Data j GRP y ! GRP y

Description
Bitwise or between the value in register GRPy and the value Data. The result is
stored in register GRPy. The flags N and Z are updated.

Example
ori #h1111 GRPy

Register/Memory Before After


Status Reg b1000 b1000
GRPy h8020 h9131

110
repeat
Type of instruction
Program flow instruction - repeat instruction

Syntax
Immediate data: repeat #Data

Operands
0 Addr  255
Execution
Data ! RepeatReg

Description
The instruction following the repeat instruction is repeated Data number of times
before the PC is incremented again. The flags are not updated.

Example
repeat #20

Register/Memory Before After


Repeat Reg h0000 h0014

111
round
Type of instruction
MAC instruction - round.

Syntax
Register direct: round ACCx [SAT]

Operands
AC C x : AC C 0; AC C 1
Execution
if AC C xlowgreaterthanh 8000thenAC C X + 8000 !
h AC C x

Description
Rounds the accumulator register
If bit 15 of ACCx is a ’1’ h8000 is added to ACCx
If SAT is added the result will be saturated after rounding.
The flags N, Z and O are updated.

Example
rnd ACC0

Register/Memory Before After


ACC0 h0000109011 h0000119011

112
rsl
Type of instruction
Shift instruction - rotational shift.

Syntax
Register direct: rsl GRPx GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP y << (GRP x & 1 )!
h F GRP y

Description
The value in register GRPy is rotated GRPx steps to the left without carry between
msb and lsb. The result is stored in register GRPy. The flags N and Z are updated.
Negative value in GRPx results in right rotation.

Example
rsl GRPx GRPy

Register/Memory Before After


Status Reg b0000 b1000
GRPx h0002 h0002
GRPy h2222 h8888

113
rsli
Type of instruction
Shift instruction - Rotational shift with immediate data.

Syntax
Immediate data: rsli #Step GRPy

Operands
15  S tep  15
GRPy: GRP0 - GRP31

Execution
GRP y << S tep ! GRP y

Description
The value in register GRPy is rotated, Step, steps to the left without carry between
msb and lsb. Negative Step gives rotation to the right. The result is stored in reg-
ister GRPy. The flags N and Z are updated.

Example
rsli #4, GRPy

Register/Memory Before After


Status Reg b1000 b0000
GRPy hF222 h222F

114
rslc
Type of instruction
Shift instruction - Rotation with intermediate carry.

Syntax
Register direct: rslc GRPx GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP y << (GRP x & )!
hF GRP y

Description
The value in register GRPy is rotaded GRPx steps to the left with carry storage
between msb and lsb. Negative value in GRPx gives rotation to the right The re-
sult is stored in register GRPy. The flags N, Z and C are updated.

Example
rslc GRPx, GRPy

Register/Memory Before After


Status Reg b0010 b0010
GRPx h0003 h0003
GRPy h2222 h1114

115
rslci
Type of instruction
Shift instruction - Rotation with intermediate carry and immediate data.

Syntax
Immediate data: rslci #Step GRPy

Operands
15  S tep  15
GRPy: GRP0 - GRP31
Execution
GRP y << S tep ! GRP y

Description
The value in register GRPy is rotated, Step, steps to the left with carry storage
between msb and lsb. The result is stored in register GRPy. The flags N, Z and C
are updated.

Example
rslci #4 GRPy

Register/Memory Before After


Status Reg b1000 h0010
GRPy hF222 h2227

116
rts
Type of instruction
Program flow instruction - return from subroutine.

Syntax
No operands: rts

Operands

Execution
PC stack ! PC

Description
This instruction jumps back from the subroutine and restores the PC value.
The jump is delayed two cycles, that is the two instructions following the rts
instruction are executed before the jump is taken. None of the two following in-
structions may be bra, call, rts, loop or repeat instructions. rts may not be used as
a repeat instruction or as one of the two last instructions in a hardware loop.
No flags are updated.

Example
rts

Register/Memory Before After


PC stack top h0008 hxxxx
PC h0200 h0008

117
sat
Type of instruction
MAC instruction - saturate.

Syntax
register direct: sat ACCx

Operands
AC C x : AC C 0; AC C 1
Execution
(
sat AC C X )! AC C x

Description
Saturate accumulator register
If the value of ACCx cannot be represented with 32 bits, ACCx will be set too
h00007FFFFFFF or hFFFF80000000 depending on the sign of ACCx. Otherwise
the value will be kept.
Flag O is set if ACCx was larger than 32-bits.

Example
sat ACC0

Register/Memory Before After


Ex1: ACC0 h03xxxxxxxx h007FFFFFFF

Ex2: ACC0 hF3xxxxxxxx hFF80000000

118
sub
Type of instruction
Arithmetic instruction - subtraction.

Syntax
Register direct without carry: sub GRPx GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP y GRP x ! GRP y

Description
The value in register GRPx is subtracted from the value in register GRPy and the
result is stored in register GRPy. The flags N, Z, C and O are updated. C is set
when unsigned subtraction does not generate borrow. O is set when signed sub-
traction generates overflow.

Example
sub GRPx GRPy

Register/Memory Before After


Status Reg b0000 b0010
GRPx h0012 h0012
GRPy h0020 h000e

119
subc
Type of instruction
Arithmetic instruction - subtraction with carry.

Syntax
Register direct with carry: subc GRPx GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP y GRP x 1+ ! C GRP y

Description
The value in register GRPx is subtracted from the value in register GRPy. If C is
not set (for example if the previous instruction was as subtraction that generated
borrow) one more is subtracted. The result is stored in register GRPy. The flags
N, Z, C and O are updated. C is set if borrow does not occur.

Example
subc GRPx GRPy

Register/Memory Before After


Status Reg b0000 b0010
GRPx h0012 h0012
GRPy h0020 h000d

120
subi
Type of instruction
Arithmetic instruction - subtraction with immediate data

Syntax
Immediate data without carry: subi #Data, GRPy

Operands
h0 Data  hF F F F

GRPy: GRP0 - GRP31

Execution
GRP y Data ! GRP y

Description
The value Data is subtracted from the value in register GRPy and the result is
stored in register GRPy. The flags N, Z, C and O are updated. C is set when
unsigned subtraction does not generate borrow. O is set when signed subtraction
generates overflow.

Example
subi #h5 GRPy

Register/Memory Before After


GRPy h0020 h001b

121
xor
Type of instruction
Logic instruction - bitwise xor.

Syntax
Register direct: xor GRPx GRPy

Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31

Execution
GRP x xor GRP y ! GRP y

Description
Bitwise xor between the values in register GRPx and GRPy. The result is stored
in register GRPy. The flags N and Z are updated.

Example
xor GRPx GRPy

Register/Memory Before After


Status Reg b1000 b0000
GRPx h8000 h8000
GRPy h8012 h0012

122
xori
Type of instruction
Logic instruction - bitwise xor with immediate data.

Syntax
Immediate data xori #Data GRPy

Operands
h8000  Data  7
h FFFF

GRPy: GRP0 - GRP31

Execution
Data xor GRP y ! GRP y

Description
Bitwise xor between the value in register GRPy and the value Data. The result is
stored in register GRPy. The flags N and Z are updated.

Example
xori #h1111 GRPy

Register/Memory Before After


Status Reg h0000 h0000
GRPy h1234 h0325

123
Appendix B

Assembly code for FIR-filter

This is the assembly code for the FIR-filter program that was used for verification.
The lack of I/O instructions makes is a bit awkward.

* FIR-filter
* input is dm(0:m)
* output is dm(2000:2000+m)
* tap coefficients tm(0:n)
* ARP0 Tap
* ARP1 Input
* ARP2 First sample
* ARP3 Output

load #9 CONTROL * fractional mode and modulo adressing for ARP0

load #0 ARP0 ** Initialize modulo adressing **


load #0 TOP0
load #31 BOTTOM0 * = number of taps
load #1 STEP0

load #1 STEP1
load #0 ARP2 * input start adress
load #1 STEP2
load #2000 ARP3 * output start adress
load #1 STEP3

load #1189 LOOP * = number of samples-number of taps = 1220-31


loop #19
move ARP2 ARP1 ** LOOP START **
clracc ACC0
loadtm ARP0++ GRP20 , loaddm ARP1++ GRP21
repeat #31 * = number of taps
mac GRP20 GRP21 ACC0 , loadtm ARP0++ GRP20 , loaddm ARP1++ GRP21
addi #1 ARP2
rnd ACC0 SAT * rounding and saturation
move2dm ARP3++ GRP27 ** LOOP END ** save output

124
Appendix C

Pipeline Timing Analysis

In order to find potential pipeline conflicts many special program flow cases where
studied in detail and pipeline timing diagrams where made. Here, a few simple
cases are shown to illustrate how delayed jumps and hardware loops work.

0: braeq #10
1: instr1
2: instr2
3: instr3
10: instr10

PC: 0 1 2 3/10
fetch: braeq instr1 instr2 instr3/10
decode: braeq instr1 instr2 instr3/10
execute: braeq instr1 instr2 instr3/10

Figure C.1: Delayed branch. The two instructions following a jump or branch are
always executed, whether the jump is taken or not.

125
0: repeat #3
1: instr1
2: instr2
3: instr3
4: instr4

PC: 0 1 2 3 3 3 4

fetch: repeat instr1 instr2 instr3 instr3 instr3 instr4


decode: repeat instr1 instr2 instr2 instr2 instr3
execute: repeat instr1 instr1 instr1 instr2
repeat reg: 1 1 1 3 2 1 1

Figure C.2: The repeat instruction. When the repeat instruction is executed, its
argument is copied to the repeat register. As long as the value in the repeat register
is greater than one, the PC, the instruction register and the control registers are not
updated.

0: loop #4
1: instr1
2: instr2
3: instr3
4: instr4

PC: 0 1 2 3 4 1 2 3 4 1
fetch: loop instr1 instr2 instr3 instr4 instr1 instr2 instr3 instr4 instr1
decode: loop instr1 instr2 instr3 instr4 instr1 instr2 instr3 instr4
execute: loop instr1 instr2 instr3 instr4 instr1 instr2 instr3
LOOP: 5 5 5 5 5 5 5 4 4 4
top of loop stack

start: 1 1 1 1 1 1 1
end: 4 4 4 4 4 4 4
counter: 5 5 4 4 4 4 3

Figure C.3: The loop instruction. Before executing the loop instruction the num-
ber of loops has to be loaded to the LOOP register. When the loop instruction is
executed, loop start, loop end and number of loops are pushed to the loop stack.
When PC is equal to the loop end value, the loop start value is copied to the
PC. There is a two cycle delay before the loop counter value is copied back to the
LOOP register. In that way LOOP is updated the same cycle as the first instruction
in the loop is executed.

126
På svenska
Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –
under en längre tid från publiceringsdatum under förutsättning att inga extra-
ordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,
skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ick-
ekommersiell forskning och för undervisning. Överföring av upphovsrätten vid
en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av
dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,
säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ
art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den
omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna
sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i
sådant sammanhang som är kränkande för upphovsmannens litterära eller konst-
närliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se för-
lagets hemsida http://www.ep.liu.se/

In English

The publishers will keep this document online on the Internet - or its possible
replacement - for a considerable time from the date of publication barring excep-
tional circumstances.
The online availability of the document implies a permanent permission for
anyone to read, to download, to print out single copies for your own use and to
use it unchanged for any non-commercial research and educational purpose. Sub-
sequent transfers of copyright cannot revoke this permission. All other uses of
the document are conditional on the consent of the copyright owner. The pub-
lisher has taken technical and administrative measures to assure authenticity,
security and accessibility.
According to intellectual property law the author has the right to be men-
tioned when his/her work is accessed as described above and to be protected
against infringement.
For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity, please
refer to its WWW home page: http://www.ep.liu.se/

© Eric Tell
Avdelning, Institution Datum
Division, Department Date
2000-12-17

Institutionen för Systemteknik


581 83 LINKÖPING

Språk Rapporttyp ISBN


Language Report category
Svenska/Swedish Licentiatavhandling
ISRN LITH-ISY-EX-3209-2001
X Engelska/English X Examensarbete
C-uppsats
Serietitel och serienummer ISSN
D-uppsats Title of series, numbering
Övrig rapport
____

URL för elektronisk version


http://www.ep.liu.se/exjobb/isy/2001/3209/

Titel En domänspecifik DSP-processor


Title
A Domain Specific DSP Processor

Författare Eric Tell


Author

Sammanfattning
Abstract
This thesis describes the design of a domain specific DSP processor.

The thesis is divided into two parts. The first part gives some theoretical background, describes the
different steps of the design process (both for DSP processors in general and for this project) and
motivates the design decisions made for this processor.

The second part is a nearly complete design specification.

The intended use of the processor is as a platform for hardware acceleration units. Support for this
has however not yet been implemented.

Nyckelord
Keyword
DSP processor design, CPU design

You might also like