DSP Prcoessor
DSP Prcoessor
Eric Tell
             Linköping 2001
                                     Abstract
    The thesis is divided into two parts. The first part gives some theoretical back-
ground, describes the different steps of the design process (both for DSP processor
design in general and for this project) and motivates the design decisions made for
this processor.
    The second part is a nearly complete design specification.
    The intended use of the processor is as a platform for hardware acceleration
units. Support for this has however not yet been implemented.
Contents
                                       1
5 Machine Code Design                                                                                              18
  5.1 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                        18
  5.2 The Instruction Word of This Processor . . . . . . . . . . . . . .                                           19
8 Benchmarking                                                                                                     30
  8.1 MIPS and MACS . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
  8.2 Application Benchmarking . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
  8.3 Algorithm Kernel Benchmarking        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
  8.4 Tools for Benchmarking . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
  8.5 Benchmarks for This Processor .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
10 RTL Implementation                                                                                              38
   10.1 Micro Architecture . . . . . . . . . . . . . . . . . . . . . . . . .                                       38
   10.2 VHDL Implementation . . . . . . . . . . . . . . . . . . . . . . .                                          38
                                       2
11 Verification                                                                                                   40
   11.1 The Verification Strategy . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   40
   11.2 Verification for This Project . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
         11.2.1 Block Level Verification . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
         11.2.2 Instruction Level Verification .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
         11.2.3 Random Testing . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   42
         11.2.4 Application Level Verification        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   42
II Design Specification                                                                                           47
13 Introduction                                                                                                   48
   13.1 Processor Features . . . . . . . . . . . . . . . . . . . . . . . . .                                      48
   13.2 Outline of This Part of the Thesis . . . . . . . . . . . . . . . . . .                                    49
14 Data Path                                                                                                      50
   14.1 Architecture Overview . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   50
   14.2 Arithmetic Unit . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   51
   14.3 Shift Unit . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   52
   14.4 Logic Unit . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   52
   14.5 MAC Unit . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   53
        14.5.1 Rounding . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   54
        14.5.2 Saturation Unit . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   54
   14.6 Register File . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   54
        14.6.1 The Accumulator Registers .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   57
        14.6.2 The Control Register . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   57
   14.7 Addressing . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   60
        14.7.1 Addressing Modes . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   60
        14.7.2 Modulo Addressing . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   61
                                        3
        14.7.3 Bit-Reversed Addressing . . . . . . . . . . . . . . . . . .                                    61
   14.8 The Status Register . . . . . . . . . . . . . . . . . . . . . . . . .                                 61
15 Control Path                                                                                               63
   15.1 The Pipeline . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    63
   15.2 Instruction Decoder . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .    64
   15.3 Program Counter . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .    65
   15.4 Program Flow Controller . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .    65
        15.4.1 Subroutine Calls - The PC Stack       .   .   .   .   .   .   .   .   .   .   .   .   .   .    65
        15.4.2 Hardware Looping . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .    66
   15.5 Pipeline Controller . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .    67
   15.6 Branch Controller . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .    67
16 Instruction Set                                                                                            68
   16.1 The Instruction Word . . . . . . . . . . . . . . . . . . . . . .                             .   .    68
   16.2 Parallel Memory Instructions . . . . . . . . . . . . . . . . . .                             .   .    69
        16.2.1 Move to Memory: . . . . . . . . . . . . . . . . . . .                                 .   .    69
        16.2.2 MAC Operation and Load . . . . . . . . . . . . . . .                                  .   .    70
        16.2.3 Arithmetic, Logic, Shift or Move Operation and Load .                                 .   .    70
   16.3 Instruction Set Restrictions . . . . . . . . . . . . . . . . . . .                           .   .    70
        16.3.1 Branch and Jump Instructions . . . . . . . . . . . . .                                .   .    71
        16.3.2 Hardware Loops . . . . . . . . . . . . . . . . . . . .                                .   .    71
        16.3.3 Modulo Addressing . . . . . . . . . . . . . . . . . . .                               .   .    71
   16.4 Instruction Encoding . . . . . . . . . . . . . . . . . . . . . .                             .   .    72
                                        4
            Part I
               5
Chapter 1
Introduction
                                          6
this project. The second part is a more or less complete design specification for
the processor.
    The first part has the following contents:
Chapter 2 describes the special features of DSP processors that separates them
    from general purpose processors.
                                        7
Chapter 2
This chapter describes some important differences between DSP processors and
general purpose processors. It also explains some concepts that are used later in
the report.
2.1 Architecture
The most important difference between the architecture of DSP processors and
general purpose processors is probably the possibilities for multiple memory ac-
cesses in one clock cycle. Generally a DSP processor has separate program and
data memories. This allows the processor to fetch an instruction, while simulta-
neously fetching operands or storing results for a previous instruction. Often it
is also possible to fetch multiple data from memory in one clock cycle by using
multiple busses and multi port memories or multiple independent data memories.
                                         8
possible to accumulate a number of values without the risk of overflow. If n guard
                  2
bits are used n values can be accumulated without the possibility of overflow.
Most DSP processors have four or eight guard bits.
    The MAC hardware usually also supports saturation (see 2.3 below) and round-
ing, to get a result of the native data width1 .
Example: If the native data width is n, then the result of the multiplication will
         2
have n bits. So with m guard bits the result of the MAC operation will have
2 +
 n                                                    2
      m bits. This value can be saturated to a n-bit value and then rounded to get
a value of the native data width n, that can be stored in memory or used in other
kinds of operations.
                                                 9
 a) Normal arithmetic hardware                b) Saturation arithmetic
Figure 2.1: The difference between saturation and normal arithmetic. The X-axis
is the “real” result and the Y-axis is the output from hardware
                                         10
110 = 6           011 = 3
111 = 7           111 = 7
   Because the FFT algorithm is so common, many DSP processors have hard-
ware support for bit-reversed addressing.
                                       11
Chapter 3
This chapter gives an overview of the design flow for a DSP Processor. Figure 3.1
illustrates the flow and a short description of every step follows.
                                         12
                                 Requirement analysis
Behavioral model
Benchmarking
Architecture design
RTL Implementation
Verification
3.4 Benchmarking
Benchmarking is used to verify that the instruction set offers sufficient perfor-
mance to fulfill the requirements set up during requirement analysis. If it does
                                         13
not, the instruction set has to be modified. Typically performance is increased
by moving tasks from software to hardware. After a few iterations, hopefully a
working instruction set can be released. After this is done the software engineers
can start their work concurrently with the hardware development.
    Chapter 8 describes benchmarking further.
    Chapter 10 says a little more about architecture design and RTL implementa-
tion.
3.7 Verification
Verification is a very large and very important part of the design process. Although
it is the final step before the implementation is released it is important to have a
good verification strategy from the beginning and to keep it in mind during every
step of the flow.
     The verification can be divided into functional verification, where for example
the logical correctness of HDL code is verified, and physical verification ,which
means verifying for example timing constraints. To perform physical verification
obviously the HDL code (or at least parts of it) has to be synthesized first.
     In this project no synthesis and thus no physical verification has been made
yet.
                                        14
    If errors are found during verification one, has to go back to the RTL im-
plementation or architecture design to make corrections. When the verification
result is satisfactory (it is impossible to test everything) the RTL implementation
is released.
    The functional verification process is described more thoroughly in chapter
11.
                                        15
Chapter 4
   5. Use a multiple bus architecture that allows multiple memory accesses in one
      clock cycle.
                                        16
   7. Provide support for fast hardware looping.
     Most of the addressing modes common in DSP processors are also supported.
Furthermore the instruction set allows memory accesses in parallel with compu-
tational operations (execute one operation and simultaneously load operands for
the next one from memory). This can improve the performance for many algo-
rithms (for example convolution based algorithms like FFT and FIR/IIR-filters)
most significantly.
     See chapter 16 and appendix A for a complete description of the instruction
set.
                                       17
Chapter 5
This chapter discusses how to choose the instruction encoding and explains the
choice of instruction word for this processor.
5.1 Orthogonality
One way to measure how “good” the instruction set of a processor is, is the con-
cept of orthogonality. On instruction set level, orthogonality refers to the com-
pleteness and consistency of the instruction set and to which degree different ad-
dressing modes are uniformly available with different operations [2]. For example
a processor that has an add function but not a subtract function, or where the sub-
tract function supports different addressing modes than the add function would be
considered nonorthogonal.
    On machine code level orthogonality relies on the principle of dividing the in-
structions into different groups of instructions that works similarly. The machine
code can then be multiplexed, except for the multiplex control field of the binary
code that chooses which group the instruction belongs to. This significantly sim-
plifies the instruction decoding, since most control signals can be decoded from
only a small number of the instruction word bits.
    Figure 5.1 shows an example of an orthogonal instruction word.
    Another way of increasing orthogonality and simplifying decoding is to di-
vide the instruction word into subfields that as far as possible always have the
same function. For example the bits selecting the instruction and those select-
ing operands should be separated in the instruction word and source/destination
register should always be decided by the same bits.
                                        18
             Bits selecting
             instruction group                      Bits selecting
            (arithmetic, logic, mac, etc)           source register
                         bit n                                        bit 0
                                 Mux      Op        Source     Dest
Figure 5.1: An instruction word and its subfields. An instruction set where all
instructions used this format would be considered highly orthogonal.
                                               19
onality and lots of space for new instructions and future improvements. Some
instructions, particularly those using 16-bit immediate data, could hardly have
been implemented at all using a 24-bit instruction set.
    Furthermore since this processor is partly for demonstration purposes the sim-
plicity provided by a highly orthogonal instruction set was even more preferable.
The instruction word is further described in section 16.1.
                                       20
Chapter 6
This chapter describes the design of the top level architecture - Computational
units, busses, memories and registers.
                                         21
connected together - tri-state buffers are not used.
                                         22
6.3 Concurrent Design of Instruction Set and Ar-
    chitecture
In practice instruction set design, machine code design and architecture planning
are to a large extent done concurrently. When you decide to add an instruction
to the instruction set you also have to consider how it could be implemented in
hardware and if there is “enough space” for it in the instruction word. Otherwise
you will surely run into trouble at the later steps.
    Furthermore as the architecture and machine code “evolves” you can often
find new instructions that can be implemented with very little extra cost (in terms
of hardware, instruction word length or loss of orthogonality)
    So the development is in fact more of an iterative process - instruction set and
architecture are built up concurrently step by step.
                                        23
Chapter 7
This chapter describes the instruction set simulator (ISS), what it is, why there is
one and how it works.
7.1 What?
The Instruction set simulator is just what the name says - a program that simulates
the function of all the instructions of the processor.
    The ISS simply loads a binary file generated by the assembler, transforms
it back to assembly language instructions and runs it instruction by instruction,
generating the exact same result as the actual processor would have. It also has
features for debugging, saving simulation results to file and more.
7.2 Why?
The ISS is very important in the design flow and is used to some extent in almost
every step.
                                        24
7.2.2 A Behavioral Model
The ISS is used to verify the behavior of the processor, that is to verify that it
really does exactly what it is intended to do, that it can really run all the kinds
of applications it is supposed to and, last but not least important, that it can do it
with sufficient performance (measured in number of useful instructions per clock
cycle or something similar; See also chapter 8). For these reasons a bit-true and
cycle-true ISS is needed, in other words it has to both produce exactly the right
results on instruction level and keep track of exactly how many clock cycles will
be used. (It is not as simple as just one instruction per clock cycle, especially with
a more complex pipeline)
7.2.3 Verification
Maybe the most important use of the ISS is for verification of the hardware. Be-
ginning at the Instruction level basically all verification of the hardware is done
by comparing the test results from the hardware with those generated by the ISS
behavioral model.
7.3 How?
This section describes the ISS developed for this project.
7.3.1 Features
Apart from disassembling and running the program, either the whole program or
one instruction at the time, and showing the contents of registers, the ISS has the
following features:
                                         25
Modifying Registers and Memory
Contents of general purpose registers, program counter and memory can be altered
manually.
Breakpoints
Breakpoints can be entered causing execution of the program to halt at a specified
line of code.
Script files
All functions available within the simulator can be executed from a script file, that
can be run either from within the simulator or automatically at startup.
Batch Mode
The simulator has a special batch mode for use in for example shell scripts. In
batch mode the simulator automatically starts, loads a program, runs a script file
and quits. (The script would typically load input data from file, run the program
and save the output to another file.)
   1
   the term tap comes from digital filtering: a filter is divided into taps, each consisting of a
MAC operation where data is multiplied with a coefficient, so a tap memory is typically a data
memory holding (filter-) coefficients
                                              26
7.3.2 Implementation
The ISS was implemented using C++. The code is divided into different files
so that everything that is dependent on the processor architecture is separated
from things related only to how the simulator works. Figure 7.1 and 7.2 shows
flow charts of the most important functions of the simulator, namely loading and
running a program and executing an instruction.
                                      27
                                                                        ’run’ command issued
     Get filename
                                                                  Execute instruction at address pc
File exists? no
             yes
                                                               Check instruction on new pc address
       Open file                                               to see if a ’nop’ should be inserted
    1: Read line
                                                                   no                               yes
   2: Interpret line                                                          ’nop’ inserted?
yes no
       Reset                                                                                       yes
   (pc=0, clock=0)                                                        pc>program size?
                                                                                        no
        Display
         status                                                         yes       step=0?
                                                                                       no
     Wait for user
     command                                                                                       yes          Display
                                                                                   step=1?
                                                                                                                 status
                                                                                       no
                                                                                 step=step−1
                                                                                                           Wait for user
                                                                                                            command
                                                         28
                                             a warning stops
                                         execution after completing                                                        Fetch operands
                                             the instruction
loop instruction
repeat instruction
                                                                                                                                                                                                                                                              other instruction
                                                      subroutine call
jump or branch
pop pcstack to jmpaddr                 push pc to pc stack                              if jump taken                                             push to loop stack:                                            repeatreg=                      1: Make calculations
    jumpdelay=3                        jmpaddr = call address                                jumpdelay=3                                          loopstart= pc+1                                                operand                         2: Save result
                                       jumpdelay=3                                           jmpaddr = jump adress                                loopend = operand                                                                              3: Update statusflags
                                                                                        else                                                      loopcounter=loopreg
                                                                                             do nothing
                                                                                                                               jmpdelay>1
jmpdelay=0
jmpdelay=1
                                                                                  repeatreg=                                                Yes
                                 repeat                                           repeatreg−1                                                       repeatreg>1?
                                 instruction
                                                                                                                                                                 No                                                        end of loop
                                          normal                                                                                            No                                              Yes
                                                                                                pc=pc+1                                             pc=loopend?
                                          execution
                                                                                                                           29
Chapter 8
Benchmarking
                                        30
can also perform other operations in parallel with MAC operations.
                                        31
pany that, among other things, publishes impartial technical evaluations of DSP
processors).
                                        32
Function                  Description                       Example Application
Real Block FIR            Finite impulse response fil-      Speech processing (e.g.
                          ter that operates on a block      G.728 speech encoding).
                          of real (not complex) data.
Complex Block FIR         FIR filter that operates on       Modem channel equalization.
                          on a block of complex data.
Vector Dot Product        Sum of the pointwise multi-       Convolution, correlation, matrix
                          plication of two vectors.         multiplication, multi-dimensio-
                                                            nal signal processing.
Viterbi Decoder           Decode a block of bits that has   Error control coding.
                          been convolutionally encoded.
The control path1 of a processor has three necessary parts. The first is the program
memory or control memory, where all the instructions of the program are stored.
The second is the program flow controller that generates the program counter (PC)
address, that points out the next instruction to be fetched from program memory.
Finally the instruction decoder decodes the control signals (both to control path
and data path) from the instruction word.
    Usually there is also a PC stack for saving return addresses for subroutine
calls, hardware for supporting hardware looping, interrupt handling and many
other things (though many of these might be considered to be part of the program
flow controller).
    This processor has a PC stack for subroutine calls, a loop stack for supporting
nested hardware loops, a repeat register for simple repeating of one instruction and
a pipeline controller whose purpose is described in 9.1 below. Interrupt handling
is not yet implemented.
                                             34
pipeline steps and execute all steps in parallel. This could mean for example
that in the same clock cycle as one instruction is fetched from memory, another
instruction is decoded, and yet another is executed by a computational unit. In
this way the performance of the processor is increased.
     DSP processors usually use three or four pipeline steps, but other solutions
also exist. A longer pipeline allows the processor to execute faster, but program-
ming usually becomes a bit more complicated and branching effects (see 9.2) and
similar complications have greater impact.
     This processor has a variable pipeline depth. Most instructions are executed
in three steps (fetch, decode and execute) but due to the long critical path of the
multiplication unit, the execution part of the multiply and mac instructions2 are
pipelined into two, steps giving a total of four pipeline steps for these instructions.
     This might sound like a complicated solution, but as it turned out it could be
handled with little extra hardware. Conflicts can occur when a four step instruc-
tion is followed by a three step instruction that uses some of the same resources in
the third step as the four step instruction in its last step, but this is handled without
greater difficulties: The pipeline control unit monitors what kind of instruction is
currently executing, what the next instruction is and what resources these instruc-
tions use. It will then halt the pipeline for one clock cycle (by inserting a nop
instruction) when this is needed to avoid conflicts. An example of this is shown
in figure 9.1. In most cases it is possible to avoid these extra clock cycles by rear-
ranging the program code (so that a four step instruction is never followed directly
by a three step instruction that uses the same resources.)
Note: The organization of the register file allows one MAC unit instruction and
one other instruction to write to it in the same clock cycle, as long as they don’t
use exactly the same register.
                                             35
 fetch:         mac     add1    add2                 fetch:         mac     add1     add2   add2
 decode:                mac     add1   add2          decode:                mac      add1   add1   add2
 execute:                       mac    add1   add2   execute:                        mac    nop    add1   add2
 mac step 2:                           mac           mac step 2:                            mac
               a) No conflict                                      b) Nop inserted
Figure 9.1: A mac instruction (four pipeline steps) is followed by two add instruc-
tions (three pipeline steps). In a) There are no problems. In b) an operand of the
first add is part of the result from the mac instruction, so a nop is inserted by the
processor to avoid error.
with nop operations. This means that every jump consumes one extra clock cycle
for every pipeline step before the execution step. The other solution is to use
delayed jumps. This means that the instructions that are already in the pipeline
are also executed. To the programmer it looks as if the jump is delayed by a
number of (typically two) instructions. This tends to make the program a bit more
difficult to follow and the possibility of having two jump instructions immediately
following each other has to be handled somehow.
    This processor uses delayed jumps (for both conditional and unconditional
jumps, subroutine calls and return from subroutine instructions). Furthermore
an instruction that may cause a jump must always be followed by two non-jump
instructions.
                                                     36
stored.
    See section 15.4.2 for further information on hardware looping.
    The complete control path is described in chapter 15.
    Section 16.3 discusses restrictions to the use of some instructions due to pipeline
complications.
                                         37
Chapter 10
RTL Implementation
                                          38
VHDL code.
    Synthesizeable VHDL code was generated for the whole processor core except
the memories, for which simple behavioral models where used.
                                     39
Chapter 11
Verification
Compliance Testing
Verifying that the design or part of the design follows its specification.
Corner Testing
Trying to find and test the most complex scenarios that are most likely to cause
errors.
                                         40
Random Testing
Since it is usually impossible to find all corner cases, it can be useful to use a
setup that generates and tests random test vectors. This often generates strange
unanticipated corner cases.
Path Coverage
Path coverage is a measure of how many of all possible interconnections between
different components are tested. Normally a path coverage of 100% is required.
Branch Coverage
This is a measure of how many of all possible combinations of multiplexer inputs
are tested. Usually a branch coverage of 100% is needed at least at the lowest
block level.
                                         41
    Program flow instructions where also tested rather extensively on instruction
level. The exception is the ’loop’ instruction which has a lot of strange special
cases that might cause problems, these where not all tested to the extent they
should have been, however most of them were tested quite thoroughly during the
block level testing of the program flow controller and PC-, loop- and repeat-stacks.
                                         42
                                                  Input
   1
0.5
        0       100    200    300      400      500       600        700     800   900   1000
                                Output from matlab (64−bit floating point)
   1
0.5
        0       100    200    300       400      500       600       700     800   900   1000
                                    Output from DSP (16−bit fixed point)
   1
0.5
0 100 200 300 400 500 600 700 800 900 1000
Figure 11.1: Fir filter outputs from matlab and from DSP processor.
   The program turned out to work very well. The difference to the result of
a Matlab implementation, using 64-bit floating point representation, was in the
same order as the precision possible with 16-bit fractional numbers ( 15 or ap-     2
proximately three units in the fifth decimal).
                                                 43
Chapter 12
This chapter summarizes results and conclusions from the project and presents
ideas for changes and future improvements of the processor. Many of these things
have already been mentioned in the previous chapters.
                                       44
a separate register.
Shorter Instructions
As mentioned before there is very much “space left” in the instruction word and
some bits are almost not used at all. Even with half of the instruction space saved
for accelerator instructions, the instruction word could easily have been made at
least two bits shorter. However if standardized memories would be used and the
instruction word length therefore should be the traditional “multiple of eight”, 24
bits would be the next smaller step and that would hardly have been achievable
without further limitations to the instruction set.
12.3.1 Interrupts
Although not implemented for this processor yet, interrupt handling is necessary
to efficiently communicate with other hardware. Basically all DSP processors
handle interrupts, however the way in which it is done is often a bit simpler (and
quicker) than for general purpose processors.
                                        45
   Specifically for this processor, the support for hardware accelerators would
probably include some sort of interrupt.
                                       46
      Part II
Design Specification
         47
Chapter 13
Introduction
This second part of the thesis describes the architecture and instruction set of the
processor. It is not a complete specification, but should at least be enough for the
user of the processor.
                                        48
13.2 Outline of This Part of the Thesis
This part of the thesis has the following chapters:
Chapter 14 Describes the architecture of the data path, the computational units,
    registers and addressing.
Chapter 15 Gives an overview of how the control part of the processor architec-
    ture works.
Chapter 16 Describes the instruction word and its subfields and lists the machine
    code of all instructions. This chapter also contains information on some
    restrictions that applies to the use of some instructions (mainly program
    flow instructions)
                                        49
Chapter 14
Data Path
                                       50
                                                  TM
                            AA   16
                   Addr
                   gen      AB   16               DM
                                                                     RA
                            DA   16
                    Reg
                                                 Arith               RB
                            DB   16
40
Shift
Logic
16 16
                                                                40
                                                  MAC
     At the most one 40-bit word and two 16-bit words can be written to the register
file in one clock cycle.
     The architecture also has two address busses AA and AB so two memories
can be addressed simultaneously.
                                          51
DA or is immediate data from the instruction word. The second operand (if there
is one) is always from DB. Addition and subtraction is done with or without satu-
ration depending on the saturation mode control bit.
    The Arithmetic unit can be seen in figure 14.2
                                  DA           DB 0
immediate operand
            0    C                                         0 1
                                                                     DA[15]
                                      Add/Sub
                                Cin              add/sub
                                                52
                                                 immediate operand
                                     DB
                                                                     DA
                       C
                       C
                                    Shift
or immediate data from the instruction word and the second operand is on DB.
The single operand for the ’not’ operation is always on DA.
    Figure 14.4 shows the logic unit.
                                            53
    A 32-bit multiplier multiplying two 16-bit operands from DA and DB into a
32-bit result. Both operands can be taken as signed or unsigned values indepen-
dently. The result of the multiplication is stored in an internal pipeline register in
the MAC unit.
    A 40-bit adder where the registered result of the multiplication, sign extended
with eight guard bits to a total of 40 bits, can be added to or subtracted from
one of the 40-bit accumulator registers. A 16- or 32-bit value from DA, or DA
concatenated with DB, can also be added or subtracted directly to an accumulator
register. The adder also facilitates rounding (see below).
    A 32-bit barrel shifter which enables the value from the accumulator to be
arithmetically shifted before reaching the adder. The number of steps to shift is ei-
ther the six least significant bits of DA or 6-bit immediate data from the instruction
word (positive value for left shift and negative for right shift)
    The MAC unit is shown in figure 14.5
14.5.1 Rounding
Rounding is executed by adding 1 to the 17:th bit position (i.e. bit 16) of the 40-
                                                             7
bit value, if the 16 least significant bits are larger than h F F F . This means that
the 24 most significant bits (16 bits plus 8 guard bits) of the result is the rounded
value and the 16 least significant bits are unaffected. (The equivalent operation
using decimal numbers would be to add one if the decimal part was greater than
or equal to 0.5 and then truncate the decimals)
                                          54
   DA              DB
U/Signed U/Signed
0 DA DB 0
Mult
                                                16               16
    int      frac
                               32                                             DA[5:0]
                              0                                                         instr[10:5]
                                                                          0
round
add/sub shift
                   sat
                                                                acc
                             40
                 16          16           8
           low
high
guard
                                                         55
      GRP0/ARP0
      GRP1/ARP1
      GRP2/APR2
      GRP3/ARP3
      GRP4/ARP4
      GRP5/ARP5
      GRP6/ARP6
   GRP7/ARP7/LOOP
      GRP8/STEP0
      GRP9/STEP1
     GRP10/STEP2
     GRP11/STEP3
   GRP12/STEP4/TOP0
 GRP13/STEP5/BOTTOM0
   GRP14/STEP6/TOP1
 GRP15/STEP7/BOTTOM1
    GRP16/CONTROL
        GRP17
        GRP18
        GRP19
        GRP20
        GRP21
        GRP22
        GRP23
        GRP24
        GRP25
    GRP26/ACC0-low
    GRP27/ACC0-high
    GRP28/ACC0-guard
     GRP29/ACC1-low
     GRP30/ACC1-high
    GRP31/ACC1-guard
             56
    GRP0 - GRP7 (ARP0 - ARP7) can also be used as address registers for ad-
dressing data and tap memory.
    GRP7 (LOOP) is also used to hold the loop counter value during hardware
loops
    GRP8 - GRP15 (STEP0 - STEP7) holds step lengths for updating the address-
ing registers during post increment addressing.
    GRP11 - GRP 15 Also holds top (TOP0/TOP1) and bottom (BOTTOM0/BOTTOM1)
registers for modulo addressing.
    GRP16 (CONTROL) holds the control bits and GRP26 - GRP31 (ACC0/ACC1)
the accumulator registers.
    The architecture of the register file can be seen in figure 14.6.
                                         57
                                           loopstack
inc7
15
23
                                                                                                31
                                                 7
                                                                                     RA
                                                                                     RB
                                                       RA
                                                                      RA
                                                       RB
                                                                      RB
                                                                                          MAC_GUARD
                          RA
                          RB
                   inc6
14
22
                                                                                                30
                                            6
                                                                                     RA
                                                                                     RB
                          RA
RA
                                                                      RA
                          RB
RB
                                                                      RB
                                                                                           MAC_HIGH
                   inc5
13
21
                                                                                                29
                                            5
                                                                                     RA
                                                                                     RB
                          RA
RA
                                                                      RA
                          RB
RB
                                                                      RB
                                                                                           MAC_LOW
                   inc4
12
20
                                                                                                28
                                            4
                                                                                     RA
                                                                                     RB
                          RA
RA
                                                                      RA
                          RB
RB
                                                                      RB                  MAC_GUARD
                   inc3
11
19
                                                                                                27
                                            3
                                                                                     RA
                                                                                     RB
                          RA
RA
                                                                      RA
                          RB
RB
RB
                                                                                          MAC_HIGH
                   inc2
10
18
                                                                                                26
                                            2
                                                                                     RA
                                                                                     RB
                          RA
RA
                                                                      RA
                          RB
RB
RB
                                                                                          MAC_LOW
                   inc1
17
                                                                                           25
                                            1
                                                            9
                          RA
RA
RA
                                                                                     RA
                          RB
RB
RB
                                                                                     RB
           inc0
16
                                                                                           24
                                                            8
                                      BR
           RA
           RB
RA
RA
                                                                                     RA
                                                       RB
RB
RB
                    offset
                           0
              AB
ACC0
                                                                                                            ACC1
                             modulo
                                                                                DB
                                                                                DA
              AA
Figure 14.6: The register file. inc0, inc1 etc is new values for post increment
                                                =           +
addressing, that is incx ARP x S T E P x. For inc0 and inc1 modulo updating
is also applied if modulo addressing is enabled, see figure 14.7. The block marked
modulo is used for offset modulo addressing. It is similar to the blocks calculating
inc0 and inc1. The block marked BR is for bitreversal.
                                                                 58
                                          TOP
                                                                 STEP[15]
  BOTTOM             0                                                                 delta[15]
                                                                                OR
                     1                                                                "modulo addressing disabled
  ARP
                                            delta            0
                         STEP[15]                                           0
                                                             1                       inc
                                                                            1
  STEP
                     0
TOP 1
BOTTOM
Figure 14.7: Modulo updating of address registers. A circuit like this generates the
signals inc0 and inc1 in figure 14.6. The multiplexers controlling the calculation
of the delta value are controlled by the sign of the step value. If STEP is positive
              =     +
then delta ARP S T E P BOT T OM                          1
                                                   and if STEP is negative delta                                    =
T OP    ARP      ST EP     1                        0        +
                            . If delta < then ARP S T E P is still in the modulo
                           =               +
addressing area and inc ARP S T E P . Otherwise inc T OP delta for               =                 +
ST EP >       0                                              0
             and BOT T OM delta for S T E P < . If modulo addressing is not
enabled inc is always ARP S T E P . +
The MO and M1 (Modulo Addressing) Control Bits
M0 and M1 enables modulo addressing for the address registers ARP0 and ARP1
respectively. See 14.7.2 below for more information on modulo addressing.
                                                    59
14.7 Addressing
The processor should be able to address up to 64 kWord program memory(PM), 64
kWord tap memory(TM), 4x64 kWord data memory(DM0 - DM3, each belonging
to a different processor) and 64 kWord third memory(3M). However at this point
no 3M and only one DM have been implemented. When everything has been
implemented all references to TM in this thesis should be replaced by “TM or
3M” and almost all references to DM should be replaced by “DM0 - DM3”.
Register Direct
The data is taken from a register, GRP0-GRP32, pointed out by the instruction.
Example: add GRP5 GRP6
Immediate Address
The data is taken from a memory address from the instruction.
Example: loaddmi #hFF03 GRP4
Immediate Data
The data is taken from the instruction.
Example: addi #34 GRP5
                                          60
Example: loaddm ARP0 #4 GRP3
Note: Register indirect addressing without post increment can be achieved either
by using offset addressing with zero offset or by using Register indirect addressing
with post increment with the step length set to zero.
                                         61
O - Overflow has occurred
    The status flags are used to generate conditions for branch instructions. The
following branch conditions are available:
Condition               Flags
Greater than            N=0 and Z=0
Greater or equal        N=0
Equal                   Z=1
Less or equal           N=1 or Z=1
Less than               N=1
Not equal               Z=0
Carry                   C=1
Not carry               C=0
Overflow                O=1
Not overflow            O=0
                                       62
Chapter 15
Control Path
                                         63
                                                                         pipe4
                                                                                 ctrl2
                                                                   nop                             data
                                                        ID                                         path
                                           instr
            PC
PM
                                                                         ctrl
                                                                         const
                                                                                         status
loopreg
three step instructions operands are fetched, the operation is performed and the
result is written back to the register file. For four step instructions operands are
fetched, multiplication is executed and the product is written to the pipeline regis-
ter in the MAC unit.
     In the fourth step of a four step instruction the value of an accumulator register
is fetched (if it is a mac instruction) the 40-bit addition/subtraction is executed and
the result is written back to the accumulator register.
     The variable pipeline depth causes problems in some cases. These are handled
by the pipeline controller which is described in 15.5.
     Appendix C has timing diagrams for the pipeline for some different program
flow cases.
                                                       64
The ctrl2 register stores signals that can be used either in step three or step four.
The pipe4 register is used during the third step to store control signals that will be
used in the fourth step.
    A special control signal in the ctrl register controls whether the signals in ctrl2
is for the third or fourth step.
                                          65
15.4.2 Hardware Looping
The processor has two instructions for zero overhead hardware looping: The sim-
ple ’repeat’ instruction that repeats the following instruction a number of times
and the more complex ’loop’ instruction that repeats any number of instructions
larger than one.
Repeat
The ’repeat’ instruction is facilitated by the repeat register. During normal ex-
ecution this register holds the value one, every instruction is executed once and
the program counter is increased by one every clock cycle. However as soon as
the value of the repeat register is not one, the PC, the instruction register and the
control registers are no longer updated. Instead the repeat register is decreased by
one every clock cycle until its value is one again. The ’repeat’ instruction simply
loads a value (larger than one) into the repeat register thereby causing the next
instruction to be repeated the specified number of times.
Loop
The ’loop’ instruction works in a completely different way than the ’repeat’ in-
struction.
    Before the ’loop’ instruction is executed the number of repetitions must be
loaded into the loop register (GRP7). The code section to be looped starts with
the instruction following the ’loop’ instruction and ends at an absolute address
specified in the instruction. When the ’loop’ instruction is executed, these two
program addresses and the value of the loop register are pushed to the loop stack.
    When the PC reaches the address equal to the loop end address on the top
of the loop stack, the PC is set to the corresponding loop start address and the
corresponding loop counter value is decreased by one. The loop counter value is
then copied back to the loop register in the register file. In that way the current
loop counter value is accessible from the program. If the loop counter value is one
when the PC reaches the loop end, the loop stack is popped.
    Since the loop stack depth is four, up to four nested loops are possible.
    Note: due to pipeline complications some restrictions applies to how other
program flow instructions can be used with the ’loop’ and ’repeat’ instructions.
This information can be found in section 16.3
                                         66
15.5 Pipeline Controller
The pipeline controller monitors the pipeline. By looking at what type of in-
struction is currently executing (if it is a three or four step instruction and which
accumulator register it uses) and which is the next instruction to be executed, it
determines if the pipeline has to be halted for one clock cycle, before the next
instruction is executed.
    Halting the pipeline means in practice that the value of PC and instruction
registers are kept and control signals of a ’nop’ operation is loaded to the control
registers.
    Halting the pipeline is necessary if a four step instruction is followed by a
three step instruction and one of the following is true:
1. The source register of the three step instruction is in the accumulator register
used by the four step instruction.
2. The three step instruction is dependent on status flags generated by the four
step instruction.
3. The three step instruction uses the MAC unit (but not the multiplier, because
then it would not be a three step instruction).
                                         67
Chapter 16
Instruction Set
                                       68
Name        Bit     Description
Mux         31:27   Multiplexer switching between different instruction groups.
Op          26:22   Operation code choosing the actual operation.
Mem         21      1 for memory write operations, 0 for read.
            20      1 if TM or 3M is used.
            19      0 for TM, 1 for 3M.
            18      1 if DM is used.
            17:16   Selects DM memory bank 0-3.
Addr1       15:13   Address register for DM or 3M/TM addressing.
SReg        12:8    Source register.
Addr2       7:5     Address register for DM addressing when two parallel
                    memory operations are made.
DReg        4:0     Destination register.
A           4       Accumulator register, 0 for ACC0, 1 for ACC1
Address/
Constant    20:5    Immediate address or constant value.
Offset      7:0     Immediate offset address value.
Prog Addr   15:0    Immediate program address.
Note: As long as 3M and DM1-DM3 are not implemented, bits 16, 17 and 19 will
always be zero.
                                       69
   Example: move2tm ARPx++ GRPx , move2dm ARPy++ GRPy
                                       70
16.3.1 Branch and Jump Instructions
This processor uses delayed jump, branch, subroutine call and subroutine return
instructions. In other words the two instructions following the jump/branch etc is
always executed whether the jump is taken or not. If one of these two instructions
was also a jump instruction, or for example a loop instruction, that would cause
complications, therefore ’jmp’, ’bra’, ’call’ and ’rts’ instructions must always be
followed by two instructions that are not program flow instructions.
                                        71
16.4 Instruction Encoding
This section describes the machine code of every instruction. The letters repre-
senting different subfields in the table have the following meanings:
A = Address register
C = Constant data, address or offset
c = Condition
D = Destination Register
M = Memory use
P = Program address register
r = Round accumulator
S = Source Register
s = Saturate accumulator
Y = Accumulator register
X = Occupied
- = Don’t care
                                                    72
00010 01011 0 0-0–            —     SSSS-   —     DDDD- move32 GRPx GRPy
00010 01100 0 CCCCC           CCC   CCCCC   CCC   DDDDD load #data GRPy
ALU instructions, logic unit:
00100 00000 0 MMMMM AAA             SSSSS   —     DDDDD   and GRPx GRPy
00100 00001 0 CCCCC           CCC   CCCCC   CCC   DDDDD   andi GRPx GRPy
00100 00010 0 MMMMM AAA             SSSSS   —     DDDDD   or GRPx GRPy
00100 00011 0 CCCCC           CCC   CCCCC   CCC   DDDDD   ori GRPx GRPy
00100 00100 0 MMMMM AAA             SSSSS   —     DDDDD   xor GRPx GRPy
00100 00101 0 CCCCC           CCC   CCCCC   CCC   DDDDD   xori GRPx GRPy
00100 00110 0 MMMMM AAA             SSSSS   —     DDDDD   not GRPx GRPy
ALU instructions, shift unit:
00110 00000 0 MMMMM AAA             SSSSS   —     DDDDD   asl GRPx GRPy
00110 00001 0 —–              —     —CC     CCC   DDDDD   asli GRPx GRPy
00110 00010 0 MMMMM AAA             SSSSS   —     DDDDD   lsl GRPx GRPy
00110 00011 0 —–              —     —CC     CCC   DDDDD   lsli GRPx GRPy
00110 00100 0 MMMMM AAA             SSSSS   —     DDDDD   rsl GRPx GRPy
00110 00101 0 —–              —     —CC     CCC   DDDDD   rsli GRPx GRPy
00110 00110 0 MMMMM AAA             SSSSS   —     DDDDD   rslc GRPx GRPy
00110 00111 0 —–              —     —CC     CCC   DDDDD   rslci GRPx GRPy
MAC instructions:
01000 00000 0 MMMMM AAA             SSSSS   AAA   YSSSS   mpy GRPx GRPy ACCx
01010 00000 0 MMMMM AAA             SSSSS   AAA   YSSSS   mpy GRPx GRPy ACCx SAT
01001 00000 0 MMMMM AAA             SSSSS   AAA   YSSSS   mpy GRPx GRPy ACCx RND
010sr 00001 0 MMMMM AAA             SSSSS   AAA   YSSSS   mpyu GRPx GRPy ACCx [SAT/RND]
010sr 00010 0 MMMMM AAA             SSSSS   AAA   YSSSS   mpysu GRPx GRPy ACCx [SAT/RND]
010sr 00011 0 MMMMM AAA             SSSSS   AAA   YSSSS   mpyus GRPx GRPy ACCx [SAT/RND]
010s0 00100 0 MMMMM AAA             SSSSS   AAA   YSSSS   mac GRPx GRPy ACCx [SAT]
010s0 00101 0 MMMMM AAA             SSSSS   AAA   YSSSS   macu GRPx GRPy ACCx [SAT]
010s0 00110 0 MMMMM AAA             SSSSS   AAA   YSSSS   macsu GRPx GRPy ACCx [SAT]
010s0 00111 0 MMMMM AAA             SSSSS   AAA   YSSSS   macus GRPx GRPy ACCx [SAT]
010s0 01000 0 MMMMM AAA             SSSSS   AAA   YSSSS   mac GRPx GRPy ACCx [SAT]
010s0 01001 0 MMMMM AAA             SSSSS   AAA   YSSSS   macu GRPx GRPy ACCx [SAT]
010s0 01010 0 MMMMM AAA             SSSSS   AAA   YSSSS   macsu GRPx GRPy ACCx [SAT]
010s0 01011 0 MMMMM AAA             SSSSS   AAA   YSSSS   macus GRPx GRPy ACCx [SAT]
010s1 01100 0 MMMMM AAA             DDDDD   AAA   YDDDD   rnd ACCx [SAT]
01010 01100 0 MMMMM AAA             DDDDD   AAA   YDDDD   sat ACCx
01000 01101 0 MMMMM AAA             DDDDD   AAA   YDDDD   clracc ACCx
010sr 01110 0 MMMMM AAA             SSSSS   AAA   YDDDD   addacc GRPx ACCx [SAT]
010sr 01111 0 MMMMM AAA             SSSSS   AAA   YDDDD   subacc GRPx ACCx [SAT]
010sr 10000 0 MMMMM AAA             SSSSS   AAA   YDDDD   add32 GRPx ACCx [SAT]
010sr 10001 0 MMMMM AAA             SSSSS   AAA   YDDDD   sub32 GRPx ACCx [SAT]
010sr 10010 0 MMMMM AAA             SSSSS   AAA   Y—–     lshl GRPx ACCx [SAT/RND]
010sr 10011 - —–              —     –CCC    CCC   Y—–     lshli GRPx ACCx [SAT/RND]
Program flow instructions:
01100 00000 0 cccc-           —     —–      —     PPPPP   bracond GRPx
01100 00001 0 cccc-           CCC   CCCCC   CCC   CCCCC   bracond #addr
01100 00010 0 —–              —     —–      —     PPPPP   jmp GRPx
01100 00011 0 —–              CCC   CCCCC   CCC   CCCCC   jmp #addr
01100 00100 0 —–              —     —–      —     PPPPP   call GRPx
01100 00101 0 —–              CCC   CCCCC   CCC   CCCCC   call #addr
01100 00111 0 —–              CCC   CCCCC   CCC   CCCCC   loop #addr
01100 01001 0 —–              —     CCCCC   CCC   —–      repeat #data
01100 01010 0 —–              —     —–      —     —–      rts
Accelerator instructions:
1XXXXXXXXX XXXXXX XXX               XXXXX XXX XXXXX Accelerator instructions.
                                                  73
Bibliography
[2] Phil Lapsley, Jeff Bier, Amit Shoham, Edward A. Lee, DSP Processor Fun-
    damentals, IEEE Press, 1995.
                                     74
Appendix A
This chapter has complete descriptions of all instructions of the processor. This
includes:
Type of instruction - Instruction group, short description.
Syntax - What the assembly code looks like, addressing modes.
Operands - What data the instruction can use.
Execution - What the instruction does (“mathematically”)
Description - Description of the behaviour of the instruction, and which registers
and statusflags are affected.
Example - A short exemple of use of the instruction.
                                       75
    abs
Type of instruction
Arithmetic instruction - absolute value
Syntax
Register direct:                               abs GRPx, GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
jGRP x  j!   GRP y
Description
The absolute value of register GRPx is stored in register GRPy. The flags are not
updated.
Example
abs GRPx, GRPy
                                          76
   add
Type of instruction
Arithmetic instruction - addition.
Syntax
Register direct without carry:              add GRPx, GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP x   +   GRP y   !   GRP y
Description
The values in register GRPx and GRPy are added and the result is stored in regis-
ter GRPy. The flags N, Z, C and O are updated. C is set when unsigned addition
generates carry. O is set when signed addition generates overflow.
Example
add GRPx, GRPy
                                       77
   addc
Type of instruction
Arithmetic instruction - addition with carry in.
Syntax
Register direct with carry:                   addc GRPx, GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP x   +   GRP y   + !
                    C     GRP y
Description
The values in register GRPx and GRPy and the value of the flag C are added and
the result is stored in register GRPy. The flags N, Z, C and O are updated. C is set
when unsigned addition generates carry. O is set when signed addition generates
overflow.
Example
addc GRPx, GRPy
                                         78
    addi
Type of instruction
Arithmetic instruction - addition with immediate data
Syntax
Immediate data without carry:                  addi #Data GRPy
Operands
h8000     Data     7
                    h FFF
Execution
Data   +   GRP y   !   GRP y
Description
The value in register GRPy and the Data value are added. The result is stored
in register GRPy. The flags N, Z, C and O are updated. C is set when unsigned
addition generates carry. O is set when signed addition generates overflow.
Example
addi #h1234 GRPy
                                          79
   and
Type of instruction
Logic isntruction - bitwise and.
Syntax
Register direct:                           and GRPx, GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP x   &   GRP y   !   GRP y
Description
Bitwise and between the values in register GRPx and GRPy. The result is stored
in register GRPy. The flags N and Z are updated.
Example
and GRPx GRPy
                                      80
    andi
Type of instruction
Logic instruction - bitwise and with immediate data
Syntax
Immediate data:                                andi #Data GRPy
Operands
h8000     Data     7
                    h FFF
Execution
Data   &   GRP y   !   GRP y
Description
Bitwise and between the value in register GRPy and the Data. The result is stored
in register GRPy. The flags N and Z are updated.
Example
andi #hFF GRPy
                                          81
   asl
Type of instruction
Shift instruction - arithmetic shift
Syntax
Register direct:                               asl GRPx, GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP y >>     (GRP x   & 001 ) !
                         h    F        GRP y
Description
If the value in GRPx is positive the value in register GRPy is shifted GRPx steps
to the left. If the value in GRPx is negative the value in GRPy is arithmeticaly
shifted -GRPx steps to the right. The result is stored in register GRPy. The flags
N, Z, C and O are updated. O is set if overflow occurs on a left shift. C is the last
bit shifted out on a right shift.
Example
asl GRPx GRPy
                                          82
   asli
Type of instruction
Shift instruction - arithmetic shift with immediate data.
Syntax
Immediate data:                               asli #Step, GRPy
Operands
  15    S tep    15
GRPy: GRP0 - GRP31
Execution
GRP y >> S tep      !   GRP y
Description
If the value Step is positive the value in register GRPy is shifted GRPx steps to the
left. If the value in GRPx is negative the value in GRPy is arithmeticaly shifted
-GRPx steps to the right. The result is stored in register GRPy. The flags N, Z,
C and O are updated. O is set if overflow occurs on a left shift. C is the last bit
shifted out on a right shift.
Example
asli #-4 GRPy
                                         83
   avg
Type of instruction
Arithmetic instruction - average value.
Syntax
Register direct:                               avg GRPx, GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRPx+GRPy    !
    2              GRP y
Description
The average value of the value in register GRPx and in register GRPy is stored in
register GRPy. The flags N and Z are updated.
Example
avg GRPx GRPy
                                          84
     brafcondg
Type of instruction
Program flow instruction - conditional jump.
Syntax
Register direct:                             brafcondg GRPx
Immediate PC address:                        brafcondg #Addr
Operands
h0000       Addr       hF F F F
Execution
if   f
     cond  g   is T RU E
GRP x ! PC
or
Addr ! PC
else
PC +1! PC
Description
A conditional branch jump. Either register or constant based.
The jump is delayed two cycles, that is the two instructions following the branch
instruction are executed either the branch is taken or not. None of the two fol-
lowing instructions may be bra, call, rts, loop or repeat instructions. Bra may not
be used as a repeat instruction or as one of the two last instructions in a hardware
loop. No flags are updated.
eq             GRP x     =
                       GRP yZ=1
ne             GRP x 6   =
                       GRP yZ=0
c carry C=1
                                        85
nc           notcarry        C=0
o            overf low       O=1
no           notoverf low    O=0
Example
bragt #h30
braeq GRPx
                                       86
    call
Type of instruction
Program flow instruction - subroutine jump
Syntax
Immediate PC adress:                         call #Addr
Register direct:                             call GRPx
Operands
h0000    Addr    7
                   h FFF
Execution
PC   ! P C stack
Addr ! PC
Description
A subroutine jump
    The jump is delayed two cycles, that is the two instructions following the call
instruction are executed before the jump is taken. None of the two following in-
structions may be bra, call, rts, loop or repeat instructions. Call may not be used
as a repeat instruction or as one of the two last instructions in a hardware loop.
No flags are updated.
Example
call #h3000
                                        87
   clracc
Type of instruction
MAC instruction - clear accumulator
Syntax
Register direct:                                 clracc ACCx
Operands
AC C x   :   AC C   0; AC C   1
Execution
0!   AC C x
Description
Clear accumulator register.
No flags are updated.
Example
clracc ACC0
                                            88
   comp
Type of instruction
Arithmetic instruction - compare two values.
Syntax
Register direct:                            comp GRPx GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP y     GRP x    !   N one
Description
The value in register GRPx is subtracted from the value in register GRPy, but the
result is not stored. The flags are updated. C is set when unsigned subtraction
does not generate borrow. O is set when signed subtraction generates overflow.
Example
comp GRPx GRPy
                                       89
    jmp
Type of instruction
Program flow instruction - jump
Syntax
Register direct:                             jmp GRPx
Immediate PC adress:                         jmp #Addr
Operands
h0000     Addr      hF F F F
Execution
Addr   !   PC
Description
Unconditional jump.
The jump is delayed two cycles, that is the two instructions following the jmp
instruction are executed before the jump is taken. None of the two following in-
structions may be bra, call, rts, loop or repeat instructions. Jmp may not be used
as a repeat instruction or as one of the two last instructions in a hardware loop.
No flags are updated.
    Example
jmp #h3000
                                        90
    load
Type of instruction
Data move instruction. Load register with immediate data
Syntax
Immediate data:       load #Const GRPy
Operands
h8000    C onst    7
                     h FFF
Execution
C onst   !   GRP y
Description
The constant, Const, is loaded into the register GRPy. No flags are updated.
Example
loadi #h2034 GRPy
                                       91
  loadtm, load3m, loaddm, loaddm0, loaddm1,
loaddm2, loaddm3
Type of instruction
Memory instruction - Load register from memory
Syntax
Register indirect with postincrement:         loadXmX ARPx++ GRPy
Register indirect with offset address:        loadXmX ARPx #Offset GRPy
Operands
h0   Of f set      hF F
Execution
DM   0(# Addr    )!   GRP y
Description
The value stored at the address ARPx or ARPx + Offset in the specified memry
is copied to register GRPy. loaddm is equivalent with loaddm0. The memory bits
decides wich memory is used.
If adressing with postincrement is used ARPx is increased with the value in the
STEPx register.
The flags N and Z are updated.
Example
loaddm ARPx++, GRPy
                                         92
Register/Memory   Before          After
ARPx Register        h0200           h0200
GRPy                 hFF12           h2222
DM(h0200)            h0000           h0000
DM(h0201)            h1111           h1111
DM(h0202)            h2222           h2222
DM(h0203)            h3333           h3333
DM(h0204)            h4444           h4444
                             93
    loaddmi
Type of instruction
Memory instruction - Load register, immediate adress
Syntax
Immediate adress:                             loadmi #Addr GRPx
Operands
h0   Of f set      hF F
Execution
DM   0(# Addr    )!   GRP x
Description
The value stored at the address #Addr in dm0 is copied to the register GRPx. The
flags are not updated. Note that there is no equivalent function for any other mem-
ories than dm0.
Example
loaddmi #hFF00 GRPy
                                         94
    loop
Type of instruction
Program flow instruction - hardware loop
Operands
h0000    Addr      hF F F F
LOOP register
Execution
PC   +1!    Loopstartstack
Addr ! Loopendstack
Description
The instructions between the loop instruction and the PC adress Addr (including
that address) is repeated a number of times specified by the value in the LOOP
register.
Up to four nested loops are possible, however two loops may never end at the
same adress. The loop must have at least two instructions (otherwise repeat is
used) and the last two instructions in a loop must not be jmp, bra, call or repeat.
Example
loadi #30 LOOP
loop #2000
                                         95
   l shl
Type of instruction
MAC instruction - 32-bit shift
Syntax
Register direct:                             l shl GRPx ACCy
Operands
GRPx: GRP0 - GRP31
ACCy: ACC0, ACC1
Execution
GRP y <<    (GRP x   & 3 )!
                       h F       GRP y
Description
If the value in GRPx is positive the value in accumulator GRPy is shifted GRPx
steps to the left. If the value in GRPx is negative the value in ACCy is arithmeti-
cally shifted -GRPx steps to the right. The result is stored in register ACCy. The
flags N, Z, C and O are updated. O is set if overflow occurs on a left shift. C is
the last bit shifted out on a right shift.
Example
l shl GRPx ACCy
                                        96
   l shli
Type of instruction
MAC instruction. 32-bit shift with immediate data
Syntax
Immediate data:                                  l shli #Steps ACCy
Operands
  32    S teps    31
AC C y   :   AC C   0; AC C   1
Execution
GRP y << S teps         !   GRP y
Description
If the value Steps is positive the value in accumulator GRPy is shifted Steps steps
to the left. If Steps is negative the value in ACCy is arithmetically shifted -Steps
steps to the right. The result is stored in register ACCy. The flags N, Z, C and O
are updated. O is set if overflow occurs on a left shift. C is the last bit shifted out
on a right shift.
Example
l shl #12 ACCy
                                            97
    lsl
Type of instruction
Shift instruction - logical shift.
Syntax
Register direct:                              lsl GRPx, GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP y <<     (GRP x    & 1 )!
                         h F         GRP y
Description
If the value in GRPx is positive the value in register GRPy is shifted GRPx steps
to the left. If the value in GRPx is negative the value in GRPy is logically shifted
-GRPx steps to the right. The result is stored in register GRPy. The flags N, Z,
C and O are updated. O is set if overflow occurs on a left shift. C is the last bit
shifted out on a right shift.
Example
lsl GRPx GRPy
                                         98
   lsli
Type of instruction
Shift instruction - logical shift with immediate data.
Syntax
Register direct:                               lsli #Step, GRPy
Operands
  15    S tep    15
GRPy: GRP0 - GRP31
Execution
GRP y >> S tep      !   GRP y
Description
If the value Step is positive the value in register GRPy is shifted Step steps to the
left. If the value in GRPx is negative the value in GRPy is logicaly shifted -Step
steps to the right. The result is stored in register GRPy. The flags N, Z, C and O
are updated. O is set if overflow occurs on a left shift. C is the last bit shifted out
on a right shift.
Example
lsli #4 GRPy
                                          99
  mac, macu, macus, macs, macsub, mac-
subu, macsubus, macsubsu
Type of instruction
Mac instruction - multiply and accumulate
Syntax
Register direct:          mac[sub][u/su/us] GRPx GRPy ACCz [SAT]
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
ACCz: ACC0, ACC1
Execution
AC C z   +   GRP x      GRP y   !   AC C z
AC C z GRP x GRP y ! AC C z
Description
The value of register GRPx is multiplied by the value of register GRPy and the
product is added to (mac, macu, macus, macsu) or subtracted from (macsub, mac-
subu, macsubus, macsubsu) the accumulator ACCz. mac and macsub executes
a signed multiplication and macu and macsubu an unsigned multiplication. ma-
cus/macsu and macsubus/macsubsu considers the first or the second operand to be
unsigned respectively. If SAT is added the accumulator will be saturated after the
accumulation. The status flags N, Z and O are updated.
Example
mac GRPx GRPy ACC0
                                              100
   move
Type of instruction
Data move instruction - move between registers
Syntax
Register direct:     move GRPx GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP x   !   GRP y
Description
The value of register GRPx is copied to register GRPy. No flags are updated.
Example
move GRPx GRPy
                                      101
    move2tm, move23m, move2dm, move2dm#
Type of instruction
Memory instruction - write to memory
Syntax
Register indirect with postincrement:                move2ftm/3m/dm[x]g ARPx++ GRPx
Register indirect with offset address:               move2ftm/3m/dm[x] ARPx #Offset GRPx
Operands
h0   Of f set      hF F
Execution
GRP x   !        (
            DM ARP x        )
GRP x   !        (
            DM ARP x            +   Of f set   )
Description
The value of register GRPx is copied to the specified memory adress. No flags are
updated.
Example
move2dm ARPx++ GRPx
                                                   102
DM(h0200)   h1234         h1234
DM(h0201)   h1234         h1234
DM(h0202)   h1234         hFF12
DM(h0203)   h1234         h1234
DM(h0204)   h1234         h1234
                    103
    movedmi
Type of instruction
Memory instruction - write to memory, immediate adress
Syntax
Adress direct:                               movedmi #Addr GRPx
Operands
h0   Of f set      hF F
Execution
GRP x   !   DM   0(  Addr   )
Description
The value stored in register GRPx is copied to the address Addr in dm0. The flags
are not updated. Note that there is no equivalent function for any other memories
than dm0.
    Example
movedmi #hFF00 GRPy
                                           104
   mpy, mpyu, mpyus, mpysu
Type of instruction
Mac instruction - multiplication
Syntax
Register direct:        mpy[u/su/us] GRPx GRPy ACCz [SAT/RND]
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
ACCz: ACC0, ACC1
Execution
GRP x      GRP y   !   AC C z
Description
The value of register GRPx is multiplied by the value of register GRPy and the
product is placed in the accumulator ACCz. mpy executes a signed multiplica-
tion and mpyu an unsigned multiplication. mpyus/mpysu considers the first or the
second operand to be unsigned respectively. If SAT is added the result will be
saturated and if RND is added the result will be rounded. The status flags N and
Z are updated.
Example
mpy GRPx GRPy ACC0
                                        105
   neg
Type of instruction
Arithmetic instruction - negate value.
Syntax
Register direct:                           neg GRPx GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
  GRP x   !   GRP y
Description
The value in register GRPx is negated and stored in register GRPy. The flags are
not updated.
Example
neg GRPx GRPy
                                         106
   nop
Type of instruction
Program flow instruction - no operation
Syntax
No operands:                               nop
Operands
Execution
PC    +1!   PC
Description
This instruction only affects the PC and is used to create execution delays.
Example
nop
                                       107
   not
Type of instruction
Logic instruction - invert register.
Syntax
Register direct:                         not GRPx GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
   (
inv GRP x   )!     GRP y
Description
The value in register GRPx is inverted bitwise and stored in register GRPy. The
flags are not updated.
Example
not GRPx GRPy
                                       108
   or
Type of instruction
Logic instruction - bitwise or.
Syntax
Register direct:                         or GRPx GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP x   j   GRP y   !   GRP y
Description
Bitwise or between the values in register GRPx and GRPy. The result is stored in
register GRPy. The flags N and Z are updated.
Example
or GRPx GRPy
                                       109
    ori
Type of instruction
Logic instruction - bitwise or with immediate data.
Syntax
Immediate data:                             ori #Data GRPy
Operands
h8000      Data    7 h FFF
Execution
Data   j   GRP y   !   GRP y
Description
Bitwise or between the value in register GRPy and the value Data. The result is
stored in register GRPy. The flags N and Z are updated.
Example
ori #h1111 GRPy
                                          110
   repeat
Type of instruction
Program flow instruction - repeat instruction
Syntax
Immediate data:                           repeat #Data
Operands
0   Addr     255
Execution
Data   !   RepeatReg
Description
The instruction following the repeat instruction is repeated Data number of times
before the PC is incremented again. The flags are not updated.
Example
repeat #20
                                       111
   round
Type of instruction
MAC instruction - round.
Syntax
Register direct:                                 round ACCx [SAT]
Operands
AC C x   :   AC C   0; AC C   1
Execution
if AC C xlowgreaterthanh          8000thenAC C X   + 8000 !
                                                      h        AC C x
Description
Rounds the accumulator register
If bit 15 of ACCx is a ’1’ h8000 is added to ACCx
If SAT is added the result will be saturated after rounding.
The flags N, Z and O are updated.
   Example
rnd ACC0
                                           112
    rsl
Type of instruction
Shift instruction - rotational shift.
Syntax
Register direct:                            rsl GRPx GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP y <<     (GRP x   & 1 )!
                         h F        GRP y
Description
The value in register GRPy is rotated GRPx steps to the left without carry between
msb and lsb. The result is stored in register GRPy. The flags N and Z are updated.
Negative value in GRPx results in right rotation.
Example
rsl GRPx GRPy
                                        113
   rsli
Type of instruction
Shift instruction - Rotational shift with immediate data.
Syntax
Immediate data:                            rsli #Step GRPy
Operands
 15    S tep    15
GRPy: GRP0 - GRP31
Execution
GRP y << S tep     !   GRP y
Description
The value in register GRPy is rotated, Step, steps to the left without carry between
msb and lsb. Negative Step gives rotation to the right. The result is stored in reg-
ister GRPy. The flags N and Z are updated.
Example
rsli #4, GRPy
                                        114
   rslc
Type of instruction
Shift instruction - Rotation with intermediate carry.
Syntax
Register direct:                           rslc GRPx GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP y <<    (GRP x   & )!
                        hF       GRP y
Description
The value in register GRPy is rotaded GRPx steps to the left with carry storage
between msb and lsb. Negative value in GRPx gives rotation to the right The re-
sult is stored in register GRPy. The flags N, Z and C are updated.
Example
rslc GRPx, GRPy
                                         115
   rslci
Type of instruction
Shift instruction - Rotation with intermediate carry and immediate data.
Syntax
Immediate data:                           rslci #Step GRPy
Operands
 15    S tep    15
GRPy: GRP0 - GRP31
Execution
GRP y << S tep     !   GRP y
Description
The value in register GRPy is rotated, Step, steps to the left with carry storage
between msb and lsb. The result is stored in register GRPy. The flags N, Z and C
are updated.
Example
rslci #4 GRPy
                                      116
      rts
Type of instruction
Program flow instruction - return from subroutine.
Syntax
No operands:                                rts
Operands
Execution
PC     stack   !   PC
Description
This instruction jumps back from the subroutine and restores the PC value.
    The jump is delayed two cycles, that is the two instructions following the rts
instruction are executed before the jump is taken. None of the two following in-
structions may be bra, call, rts, loop or repeat instructions. rts may not be used as
a repeat instruction or as one of the two last instructions in a hardware loop.
No flags are updated.
Example
rts
                                        117
    sat
Type of instruction
MAC instruction - saturate.
Syntax
register direct:                                 sat ACCx
Operands
AC C x   :   AC C   0; AC C   1
Execution
   (
sat AC C X     )!       AC C x
Description
Saturate accumulator register
If the value of ACCx cannot be represented with 32 bits, ACCx will be set too
h00007FFFFFFF or hFFFF80000000 depending on the sign of ACCx. Otherwise
the value will be kept.
Flag O is set if ACCx was larger than 32-bits.
    Example
sat ACC0
                                           118
   sub
Type of instruction
Arithmetic instruction - subtraction.
Syntax
Register direct without carry:            sub GRPx GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP y     GRP x   !   GRP y
Description
The value in register GRPx is subtracted from the value in register GRPy and the
result is stored in register GRPy. The flags N, Z, C and O are updated. C is set
when unsigned subtraction does not generate borrow. O is set when signed sub-
traction generates overflow.
Example
sub GRPx GRPy
                                        119
   subc
Type of instruction
Arithmetic instruction - subtraction with carry.
Syntax
Register direct with carry:                subc GRPx GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP y     GRP x     1+ ! C       GRP y
Description
The value in register GRPx is subtracted from the value in register GRPy. If C is
not set (for example if the previous instruction was as subtraction that generated
borrow) one more is subtracted. The result is stored in register GRPy. The flags
N, Z, C and O are updated. C is set if borrow does not occur.
Example
subc GRPx GRPy
                                         120
    subi
Type of instruction
Arithmetic instruction - subtraction with immediate data
Syntax
Immediate data without carry:              subi #Data, GRPy
Operands
h0   Data      hF F F F
Execution
GRP y    Data     !   GRP y
Description
The value Data is subtracted from the value in register GRPy and the result is
stored in register GRPy. The flags N, Z, C and O are updated. C is set when
unsigned subtraction does not generate borrow. O is set when signed subtraction
generates overflow.
Example
subi #h5 GRPy
                                         121
   xor
Type of instruction
Logic instruction - bitwise xor.
Syntax
Register direct:                        xor GRPx GRPy
Operands
GRPx: GRP0 - GRP31
GRPy: GRP0 - GRP31
Execution
GRP x xor GRP y     !   GRP y
Description
Bitwise xor between the values in register GRPx and GRPy. The result is stored
in register GRPy. The flags N and Z are updated.
Example
xor GRPx GRPy
                                      122
    xori
Type of instruction
Logic instruction - bitwise xor with immediate data.
Syntax
Immediate data                            xori #Data GRPy
Operands
h8000    Data    7
                  h FFFF
Execution
Data xor GRP y    !    GRP y
Description
Bitwise xor between the value in register GRPy and the value Data. The result is
stored in register GRPy. The flags N and Z are updated.
Example
xori #h1111 GRPy
                                      123
Appendix B
This is the assembly code for the FIR-filter program that was used for verification.
The lack of I/O instructions makes is a bit awkward.
* FIR-filter
* input is dm(0:m)
* output is dm(2000:2000+m)
* tap coefficients tm(0:n)
* ARP0 Tap
* ARP1 Input
* ARP2 First sample
* ARP3 Output
load #1 STEP1
load #0 ARP2                          * input start adress
load #1 STEP2
load #2000 ARP3                       * output start adress
load #1 STEP3
                                                124
Appendix C
In order to find potential pipeline conflicts many special program flow cases where
studied in detail and pipeline timing diagrams where made. Here, a few simple
cases are shown to illustrate how delayed jumps and hardware loops work.
      0: braeq #10
      1: instr1
      2: instr2
      3: instr3
      10: instr10
PC:          0        1        2       3/10
fetch:      braeq    instr1   instr2 instr3/10
decode:              braeq    instr1   instr2 instr3/10
execute:                      braeq    instr1    instr2 instr3/10
Figure C.1: Delayed branch. The two instructions following a jump or branch are
always executed, whether the jump is taken or not.
                                           125
                       0: repeat #3
                       1: instr1
                       2: instr2
                       3: instr3
                       4: instr4
PC: 0 1 2 3 3 3 4
Figure C.2: The repeat instruction. When the repeat instruction is executed, its
argument is copied to the repeat register. As long as the value in the repeat register
is greater than one, the PC, the instruction register and the control registers are not
updated.
                          0: loop #4
                          1: instr1
                          2: instr2
                          3: instr3
                          4: instr4
                    PC:          0        1       2        3         4          1          2              3       4        1
                    fetch:      loop    instr1   instr2   instr3    instr4     instr1     instr2        instr3   instr4   instr1
                    decode:              loop    instr1   instr2    instr3     instr4     instr1        instr2   instr3   instr4
                    execute:                     loop     instr1    instr2     instr3     instr4        instr1   instr2   instr3
                    LOOP:        5        5       5        5          5         5          5             4        4        4
top of loop stack
                    start:                                 1          1         1          1             1        1        1
                    end:                                   4          4         4          4             4        4        4
                    counter:                               5          5         4          4             4        4        3
Figure C.3: The loop instruction. Before executing the loop instruction the num-
ber of loops has to be loaded to the LOOP register. When the loop instruction is
executed, loop start, loop end and number of loops are pushed to the loop stack.
When PC is equal to the loop end value, the loop start value is copied to the
PC. There is a two cycle delay before the loop counter value is copied back to the
LOOP register. In that way LOOP is updated the same cycle as the first instruction
in the loop is executed.
                                                                     126
På svenska
Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –
under en längre tid från publiceringsdatum under förutsättning att inga extra-
ordinära omständigheter uppstår.
     Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,
skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ick-
ekommersiell forskning och för undervisning. Överföring av upphovsrätten vid
en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av
dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,
säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ
art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den
omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna
sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i
sådant sammanhang som är kränkande för upphovsmannens litterära eller konst-
närliga anseende eller egenart.
     För ytterligare information om Linköping University Electronic Press se för-
lagets hemsida http://www.ep.liu.se/
In English
The publishers will keep this document online on the Internet - or its possible
replacement - for a considerable time from the date of publication barring excep-
tional circumstances.
    The online availability of the document implies a permanent permission for
anyone to read, to download, to print out single copies for your own use and to
use it unchanged for any non-commercial research and educational purpose. Sub-
sequent transfers of copyright cannot revoke this permission. All other uses of
the document are conditional on the consent of the copyright owner. The pub-
lisher has taken technical and administrative measures to assure authenticity,
security and accessibility.
    According to intellectual property law the author has the right to be men-
tioned when his/her work is accessed as described above and to be protected
against infringement.
    For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity, please
refer to its WWW home page: http://www.ep.liu.se/
© Eric Tell
                   Avdelning, Institution                                   Datum
                   Division, Department                                     Date
                                                                            2000-12-17
Sammanfattning
Abstract
This thesis describes the design of a domain specific DSP processor.
The thesis is divided into two parts. The first part gives some theoretical background, describes the
different steps of the design process (both for DSP processors in general and for this project) and
motivates the design decisions made for this processor.
The intended use of the processor is as a platform for hardware acceleration units. Support for this
has however not yet been implemented.
Nyckelord
Keyword
DSP processor design, CPU design