5-stage Pipeline CPU hardware
Pipeline Hazards: [1]Data hazards
                     [2] Control hazards
   Control hazard occurs whenever there is change in normal sequential flow of
    program (caused by branch/jump, calling subroutine, interrupt, return from
    interrupt etc.)
                      [3]Structural hazards
   [1] multiply instruction holds Ex stage for two or more clock cycle.
   [2]Two or more instructions in pipeline try to read/write register file =>
    Since there is only one read/write port, only one instruction is allowed to
    read/write register file.
               ARM                       Architecture
   ARM core :
      Pipelined RISC CPU reduced number of fixed size instructions
      Offers high code density, small size, low power
      Applications are cell phones, handheld PDA, camera
   But few deviations from pure RISC (to gain some advantages)
        Variable cycle execution for certain instructions to support multiple
         load and store
        Inline barrel shifter leading to few complex instructions –
         preprocessing one operand enhances computational power
        Thumb state (16-bit instruction set) to improve code density
        Conditional execution of instructions for smooth pipeline operation
   Performance: speed=> MIPS@ Clk freq., DMIPS@ Clk freq.,
                           Coremark MIPS @ Clk freq.
                 power=> mW @ (Volt, Clk freq., technology)
                                                                                 5
                                DMIPS
   Dhrystone is a synthetic benchmark program for system programming. So
    DMIPS measures not instructions per second but gives an idea of how long
    overall it will take one processor to execute benchmark program
   The industries have adopted the VAX 11/780 as the reference 1 MIPS
    machine. The VAX 11/780 achieves 1757 Dhrystones per second.
   The Dhrystone figure of given Processor is calculated by measuring the
    number of Dhrystones executed per second and dividing that by 1757. So
    if a Processor is able to execute 140560 dhrystones per second, then its
    DMIPS rating is 140560/1757 = 80 DMIPS
   To compare two computing systems that run at different clock frequency,
    DMIPS is normalized to clock frequency.
        e.g. 60 DMIPS @ 40 MHz = 1.5 DMIPS/MHz
   New Benchmarking for embedded processors => CoreMark MIPS
                                                                           6
   Two source registers (Rn
    and Rm) and one result
    register Rd
   Sign Extend -> converts
    signed 8/16 bit to 32 bit
    signed value
   Barrel shifter =>
    preprocess Rm before it
    enters to ALU, it performs
    shift and rotate
    operations
   MAC unit => for multiply
    and accumulation
    operation
                          7
              ARM                         Architecture
   ARM Core under study is ARM7TDMI (32-bit RISC CPU, 3-stage pipeline)
   ARM state => Instructions are 32-bit wide and address is word aligned
   Thumb state => Instructions are 16-bit and address is half-word aligned
ARM Modes:
   Different Modes of ARM processor are defined for specific purpose
   User mode => most application softwares run in this mode
                                                                          8
            ARM                         Architecture
   Non exception modes => User, System
   Exception modes => Supervisor, IRQ, FIQ, abort, undefined
   ‘supervisor’ mode => runs embedded operating system routines
   ‘User’ mode => runs Application programs
   IRQ & FIQ modes => handles hardware interrupts
   Abort mode => handles memory access violations
   Undefined mode => handles undefined instruction
                         ARM                           Architecture
        CPSR:
       32-bit register with condition flags, control bits, status & ext.
       Only privileged modes have full write access to CPSR
              N = 1 if MSB of the ALU result is 1
              Z = 1 if Zero result from ALU
              C = 1 if ALU operation results in Carry (if Subtraction result is -ve =>C reset)
              V =1 if ALU operation oVerflowed (useful for signed numbers only)
        Flags are updated only if suffix ‘S’ is added to instruction
                                                                                             10
Banked Registers:
                    11
                ARM                    Architecture
   Total 37 registers = 30 general purpose + 6 status + 1 PC
   Different set of register in different mode of operation
   User and System mode uses same set of registers
   Shaded registers (banked registers) are hidden from user/system mode and
    available only in exception modes.
   R13 = Stack pointer (SP). Each exception mode has its own SP
   R14 = link register (LR) -> Holds return address of subroutine when it is
    called with BL instruction.
   Each exception mode has its own SP and LR
         BL <cc> subroutine_label (LR automatically stores return add.)
      The return can be in two ways
           MOV PC, LR           or
           B LR
                                                                           12
                       ARM Family and Cores
 ARM           Core                        Features            ARM ISA     Thumb
family                                                          version    version
           ARM7TDMI       3-state pipeline, thumb state        ARMv4T        v1
ARM7        ARM 720T      as ARM7TDMI, cache
            ARM 740T      as ARM7TDMI, cache
            ARM 920T      5-stage pipeline, thumb, data and inst. ARMv4T
                          cache, MMU
            ARM 922T      5-stage pipeline, thumb, data and inst.
                          cache, MMU
ARM9        ARM946E       5-stage pipeline, thumb, Enhanced DSP ARMv5TE
                          instructions, caches, MPU
            ARM926EJ      5-stage pipeline, thumb, Jazelle DBX, ARMv5TEJ
                          Enhanced DSP instructions, caches, MMU
ARM11     ARM1156T2(F) 8-stage pipeline, SIMD, Thumb-2, VFP,   ARMv6T2       v2
                       Enhanced DSP instructions
Latest => ARM Cortex Series: Profile A, Profile R, Profile M
          ARM                            Data Processing
   Syntax : <opcode> {<cc>} {S} Rd, Rn, op2
   ‘op2’ normally comes from barrel shifter and can be the following:
   Rm and Rs should not be PC (r15) in shift/rotate by register mode of ‘op2’
   shift and rotate affects N,Z,C flags
   # value for shift and rotate is 5-bit unsigned integer
                                                                                 14
15
               ARM              - The Barrel Shifter
       LSL : Logical Left Shift                 ASR: Arithmetic Right Shift
CF            Destination         0                  Destination         CF
        Multiplication by a power of 2            Division by a power of 2,
                                                   preserving the sign bit
       LSR : Logical Shift Right
                                                     ROR: Rotate Right
...0        Destination           CF                  Destination         CF
                                                  Bit rotate with wrap around
            Division by a power of 2
                                                       from LSB to MSB
                            RRX: Rotate Right Extended
                                  Destination       CF
                        Single bit rotate with wrap around
                                 from CF to MSB
                                                                                16
     ARM Data Processing Instructions
   CMP,CMN,TST & TEQ always update flags (even if ‘S’ is not used as
    suffix) and do not alter any register. They use only Rn and OP2.
   MOV & MVN use only two operands i.e. Rd and ‘op2’
                                                                        17
            ARM                     Immediate Operand
Immediate Operand (32-bit):
   ARM can not generate all 32-bit constants (32-bit immediate data)
   Instruction code contains only 12 bits to specify 32-bit constant
   Valid 32-bit constants are obtained by 8-bit constant rotated right even number of
    positions i.e. 0,2,4,…..30
   32-bit constants from given 8 bit value and 4-bit Rotate code:
         if Imm=0x40, Rotate=0xD => 32 bit constant= #4096
         if Imm=0xFF, Rotate=0x8 => 32 bit constant= #0x000000FF
   Amount of rotation is double than 4-bit field “rotate”
                                                                                         18
        ARM                              Immediate Operand
    Range of 32-bit constants for even
    rotations i.e. #0, #2 & #30
   Valid 32-bit constants : 0x000000FF, 0x00000104, 0x0000FF00, 0xF000000F, 0x0FFFFFF0
   Invalid 32-bit Constants : 0x00000102, 0x0000FF04
   Examples: (i)MOV R1, 0x00000104 (ii) MVN R2, 0xFF000000 (iii)MVN R3, 0xFC000000
Data processing:
   ADD R9, R5, R5 LSL #3 ; R9 = R5+(R5*8) = 9*R5
   RSB R9, R5, R5 LSR #3 ; R9 = (R5/8) – R5
   MOV R12, R4 ROR R3 ;R12= R4 rotated right by value of R3
   CMP R7, R5          ; update flags after (R7-R5)
Conditional Execution:
   ARM instructions can be made to execute conditionally by post fixing
    them with the appropriate condition code field. (e.g. MOVEQ R0,R1)
   Condition checks the status of appropriate flags
   If condition is true, normal execution otherwise no execution.
   Adv. => Greater pipeline performance and higher code density leading to
    higher instructions throughput
                                                                              20
ARM   Conditional Execution
                              21
           ARM                         Conditional Execution
   Set the flags, and then use various conditional codes
        CMP r0, # 0                       if (a==0) x=0;            (here r0 = a, r1= x)
        MOVEQ r1, # 0                      if (a>0) x=1;
        MOVGT r1, #1
   Set of Conditional compare instruction
        CMP r0, # 4                         if (a==4 or a==10)
        CMPNE r0, #10                   x=0;
        MOVEQ r1, # 0
   Reduces number of instructions
        While (a!=b) {
         if (a>b) a=a-b; else b=b-a; }                 (here r1 = a, r2= b)
       ------------------------------------------------------------------------------------------
     loop: CMP r1,r2                                 loop1: CMP r1, r2
             BEQ finish                             SUBGT r1, r1, r2
             BLT lessthan                           SUBLT r2, r2, r1
             SUB r1, r1, r2                                   BNE loop1
             B loop
     lessthan : SUB r2,r2,r1
                   B loop
     finish
                                                                                                    22
        ARM                           Brach Instructions
   B <cc> label         : branch to label
    ( MOV LR, PC can be used before above inst. to store return add.)
   BL <cc> subroutine_label (LR automatically stores return add.)
    24-bit offset field of Instruction code is shift left by 2 to get 26 bit
    effective offset (i.e. Total range 226)
       ± 32 Mbyte range
       How to perform longer branches? (use BX Rm)
   BX Rm : branch with exchange
      If LSB of Rm is 1, processor switches to thumb state otherwise it
        will remain in ARM state. PC= Rm & 0xFFFFFFFE
      Useful to provide interlinking between ARM and Thumb state
   BLX Rm : similar to BX Rm but additionally stores return address in
    LR
   BLX label :
      Branching in ± 32Mbyte range with LR storing return address
      Makes T=1 and Enters into Thumb state
   The T bit must not be changed by directly writing to CPSR to change
    the state of CPU
                                                                               23
                 ARM                              Multiply
   Normal (32-bit result) and long(64-bit result) multiplication
   Syntax:
        MUL {<cc>} {S} Rd, Rm, Rs ; Rd = Rm * Rs
        MLA {<cc>}{S} Rd, Rm, Rs, Rn ; Rd = (Rm * Rs) + Rn
        [U or S] MULL{<cond>}{S} RdLo, RdHi, Rm, Rs
                                ; RdHi,RdLo := Rm*Rs
        [U or S] MLAL{<cond>}{S} RdLo, RdHi, Rm, Rs
                   ; RdHi,RdLo := (Rm*Rs)+RdHi, RdLo
   MUL and MLA truncates result to least significant 32bits
   Rd must be different register than Rm or Rs
   Rs and Rm can be swapped
   N and Z flags are affected (of course if suffix ‘S’ is used)
                                                                    24
     ARM                 Load & Store Instructions
   Data movement between registers and memory
   Instruction format :  opcode<cc> <size> Rd, <address>
   Opcodes:
      LDR           STR ;32-bit Word load & store
      LDRB STRB          ;Byte load & store
      LDRH STRH          ;16-bit Halfword load & store
      LDRSB          ;Signed byte load
      LDRSH          ;Signed halfword load
       LDRB and LDRH copy 8-bit and 16-bit quantities from memory to
         destination register and forces higher bits of destination register to
         zero. For LDRSB and LDRSH the higher bits of destination register
         is replaced by sign bit
   Address:
        Formed by base register (Rn) and offset
        Base register can be any general purpose register including PC
        Offset can be (for 32-bit Word and unsigned Byte)
           signed immediate (# 12-bit value)
           register or
           scaled register (Rm with shift/rotate by # immediate only)
        Offset for H,SH & SB :- immediate value (# 8bit) and register
                                                                                  25
                   Load & Store Instructions
        Choice of indexing :- Pre-index, Pre-index write back and post index addressing
           Post index and Pre-index write back modify base register value.
    Examples:-
   LDR R8, [R3, # -3]    ; Load R8 from address R3-3 (Pre index)
                   R3 remains unchanged
   LDR R3, [R9], # 4      ; Load R3 from address R9 then R9=R9+4
                 (post index)
   STR R7, [R6, # -1] ! ; Store byte at R6-1 from R7 and then decrement
                 R6. (pre index with write back)
   LDREQB R0, [PC, -R2]         ; load R0 from PC-R2 if EQ condition is true
   LDR R11, [R3, R5, LSL # 2]      ;Load R11 from R3 + R5*4
    Note: By default, we assume ‘little endian’ format where lower byte of
    word is stored at lower address. In ‘big endian’ format lower byte of word is
    stored at higher address.
                                                                                       26
               ARM                       Pre & Post indexing
      Pre-indexed: STR r0, [r1, #12]
                          Offset                                r0
                            12            0x20c   0x5          0x5
                r1                                            Source
  Base                                                       Register
 Register      0x200                      0x200              for STR
 Pre-indexed write back : STR r0,[r1,#12]! => R1=0x20c after instruction
      Post-indexed: STR r0, [r1], #12
    Updated      r1          Offset
     Base      0x20c               12     0x20c
    Register                                                   r0
    Original     r1                                           0x5
     Base                                 0x200   0x5        Source
    Register   0x200
                                                            Register
                                                            for STR
                                                                            27
           ARM                      Load/Store Multiple
   Multiple register load and store with single instruction
   Syntax :
        LDM <CC> <add_mode> Rn {!} , {registers}
        STM <CC> <add_mode> Rn {!} , {registers}
         where add_mode :- IA | IB | DA | DB |
         Rn (base address) :- must not be PC, must not appear in register list if !
        (write back) is specified
   Block memory copy: R9 -> points to start source, R4-> total no. of words to be
    copied, R10 -> points to start of destination
        We first transfer data as bunches (say 8
         words) using LDM/STM and register
         set R0-R7
        If the last bunch has less than 8 words, then
         those remaining words can be transferred
         using LDR and STR (one word at a time)
                                                                                     28
              ARM                  Load/Store Multiple
         MOV R11, R4         // get value of R4 in R11
 loop1 : CMP R11, #8             // compare R11 by 8
         BLO skip             // skip if R11 is less than 8
          LDMIA R9!, {R0-R7}      // perform eight 32-bit word transfer
         STMIA R10!, {R0-R7}
         SUBS R11, R11, #8
         B loop1
 skip: TST R11, # 0x00000000          // is R11 zero?
        BEQ halt                // end if R11 is zero
loop2: LDR R0, R9!                // perform word by word transfer
       STR R0, R10!
         SUBS R11, R11, #1
       BNE loop2
   halt: END
        ARM                  Stack Operations
Stack Opertions:
    SP replaces Rn, add_mode are:- FD | FA | ED | EA for stack
       F and E signify whether SP points to location that is full or empty
       Stack is either ascending (growing towards high memory add.) or
        descending (growing towards low memory add.)
       One of the following pair is used in interrupt routine or handler
Example : Let R1=0x00000002, R4=0x00000003,SP=0x00000814
 STMFD sp! , {R1,R4}     ; full descending stack write
   After inst.: SP=0x0000080c , mem[0x810]=R4, mem[0x80c]=R1
                                                                              30
31
         ARM                   Miscellaneous Instr.
   SWP <cc> Rd, Rd, [Rn]
      Swap a word between memory and a register Rd
      tmp= mem32[Rn], mem32[Rn]=Rd and Rd=tmp
   SWPB <cc> Rd, Rd, [Rn] => Swap a byte
    The swap instruction is atomic- it reads and writes a memory location in the same
    bus cycle. Useful in implementing semaphore and mutual exclusion.
   Count leading zeros : CLZ <cc> Rd, Rm
CPSR instructions:
        MRS {<cc>} Rd,  <CPSR | SPSR>             ;copy from PSR to Rd
        MSR {<cc>} <CPSR | SPSR>, Rm            ; copy from Rm to PSR
     Suffix f, s, x and c can be used to modify respective field of CPSR/SPSR
        MSR cpsr_c, R0     ; update only control byte of CPSR
        MSR cpsr_fsc, R0 ; update flags, status and control byte
                       of CPSR
                                                                                        32
Assembler Pseudo Instructions:
    LDR Rd, =constant
     if constant can be constructed with MOV or MVN then this
     instruction is actually generated. Otherwise assembler
     generates a PC-relative LDR instruction that reads the constant
     from the literal pool.
     You must ensure that there is a literal pool within ±4KB range.
    LDR Rd, =label
     Stores address of label in literal pool and upon execution of
     instruction Rd is loaded with that address
                                                                       33