Parallelism in Arm Processor
Parallelism in Arm Processor
John
                      O
                                ver the past 15 years, the ARM reduced-      a system-on-chip device.1 Although this kept the
     Goodacre                   instruction-set computing (RISC) proces-     main design goal focused on performance, devel-
                                sor has evolved to offer a family of chips   opers still gave priority to high code density, low
     Andrew N.
                                that range up to a full-blown multi-         power, and small die size.
     Sloss                      processor. Embedded applications’              To achieve this design, the ARM team changed
     ARM
                      demand for increasing levels of performance and        the RISC rules to include variable-cycle execution
                      the added efficiency of key new technologies have      for certain instructions, an inline barrel shifter to
                      driven the ARM architecture’s evolution.               preprocess one of the input registers, conditional
                         Throughout this evolutionary path, the ARM          execution, a compressed 16-bit Thumb instruction
                      team has used a full range of techniques known to      set, and some enhanced DSP instructions.
                      computer architecture for exploiting parallelism.
                      The performance and efficiency methods that ARM          • Variable cycle execution. Because it is a load-
                      uses include variable execution time, subword par-         store architecture, the ARM processor must
                      allelism, digital signal processor-like operations,        first load data into one of the general-purpose
                      thread-level parallelism and exception handling,           registers before processing it. Given the single-
                      and multiprocessing.                                       cycle constraint the original RISC design
                         The developmental history of the ARM architec-          imposed, loading and storing each register indi-
                      ture shows how processors have used different types        vidually would be inefficient. Thus, the ARM
                      of parallelism over time. This development has cul-        ISA instructions specifically load and store mul-
                      minated in the new ARM11 MPCore multiprocessor.            tiple registers. These instructions take variable
                                                                                 cycles to execute, depending on the number of
                      RISC FOR EMBEDDED APPLICATIONS                             registers the processor is transferring. This is
                        Early RISC designs such as MIPS focused purely           particularly useful for saving and restoring con-
                      on high performance. Architects achieved this with         text for a procedure’s prologue and epilogue.
                      a relatively large register set, a reduced number of       This directly improves code density, reduces
                      instruction classes, a load-store architecture, and        instruction fetches, and reduces overall power
                      a simple pipeline. All these now fairly common             consumption.
                      concepts can be found in many of today’s modern          • Inline barrel shifter. To make each data pro-
                      processors.                                                cessing instruction more flexible, either a shift
                        The ARM version of RISC differed in many ways,           or rotation can preprocess one of the source
                      partly because the ARM processor became an                 registers. This gives each data processing
                      embedded processor designed to be located within           instruction more flexibility.
  Formed in 1990 as a joint venture with Acorn Computers, Apple processor, which it incorporated into a product called the ARM
Computers, and VLSI Technology (which later became Philips Second Processor, which attached to the BBC Microcomputer
Semiconductor), ARM started with only 12 employees and through a parallel communication port called “the tube.” These
adopted a unique licensing business model for its processor designs. devices were sold mainly as a research tool. ARM1 lacked the
By licensing rather than manufacturing and selling its chip tech- common multiply and divide instructions, which users had to
nology, ARM established a new business model that has redefined synthesize using a combination of data processing instructions.
the way industry designs, produces, and sells microprocessors. The condition flags and the program counter were combined,
Figure A shows how the ARM product family has evolved.                    limiting the effective addressing range to 26 bits. ARM ISA ver-
  The first ARM-powered products were the Acorn Archimedes sion 4 removed this limitation.
desktop computer and the Apple Newton PDA. The
ARM processor was designed originally as a 32-bit                                                                                                    Physical IP
preter, which was at the time very popular in the               ARM6 core
                                                                            StrongARM processor
                                                                                                                                        July 2005                     43
     Figure 2. SIMD
     versus non-SIMD                           MPEG4 ACC
     power consumption.                    MPEG4 ACC LTP
     The lightweight ARM                          ACC Dec
     implementation of                            MP3 Dec
     SIMD reduces gate                         MJPEG Dec
     count, hence
                                                MJPEG Enc                                                   ARM11 versus ARM9E
     significantly
                                  JPEG Dec (1024x768 10:1)                                                  ARM11 versus ARM7
     reducing die
     size, power, and             JPEG Enc (1024x768 10:1)
     complexity.                        H.264 Baseline Dec
                                         H.264 Baseline Enc
                                        H.263 Baseline Dec
                                         H.263 Baseline Enc
                                         MPEG4 SP Decode
                                         MPEG4 SP Encode
                                                          1.0   1.1    1.2     1.3    1.4   1.5      1.6   1.7       1.8   1.9   2.0
                                                                                     Improvement per MHz
                           ticularly useful for digital signal processing because    size, power, and complexity. Further, all the SIMD
                           nonsaturations would wrap around when the inte-           instructions execute conditionally.
                           ger value overflowed, giving a negative result. A            To improve handling of video compression sys-
                           saturated QADD instruction returns a maximum              tems such as MPEG and H.263, ARM also intro-
                           value without wrapping around.                            duced the sum of absolute differences (SAD)
                                                                                     concept, another form of DLP instructions. Motion
                           DATA-LEVEL PARALLELISM                                    estimation compares two blocks of pixels, R(I,j)
                              Following the success of the enhanced DSP              and C(I,j), by computing
                           instructions introduced in the v5TE ISA, ARM
                           introduced the ARMv6 ISA in 2001. In addition to            SAD = ∑ | R(I,j) – C(I,j) |
                           improving both data- and thread-level parallelism,
                           other goals for this design included enhanced math-         Smaller SAD values imply more similar blocks.
                           ematical operations, exception handling, and              Because motion estimation performs many SAD
                           endian-ness handling.                                     tests with different relative positions of the R and
                              An important factor influencing the ARMv6 ISA          C blocks, video compression systems require very
                           design involved increasing DSP-like functionality         fast and energy-efficient implementations of the
                           for overall video handling and 2D and 3D graph-           sum-of-absolute-differences operation.
                           ics. The design had to achieve this improved func-          The instructions USAD8 and USADA8 can com-
                           tionality while still maintaining very low power          pute the absolute difference between 8-bit values.
                           consumption. ARM identified the single-instruc-           This is particularly useful for motion-video-com-
                           tion, multiple-data architecture as the means for         pression and motion-estimation algorithms.
                           accomplishing this.
                              SIMD is a popular technique for providing data-        THREAD-LEVEL PARALLELISM
                           level parallelism without compromising code den-             We can view threads as processes, each with its
                           sity and power. A SIMD implementation requires            own program counter and register set, while hav-
                           relatively few instructions to perform complex cal-       ing the advantage of sharing a common memory
                           culations with minimum memory accesses.                   space. For thread-level parallelism, ARM needed
                              Due to a careful balancing of computational effi-      to improve exception handling to prepare for the
                           ciency and low power, ARM’s SIMD implementa-              increased complexity in handling multithreading
                           tion involved splitting the standard 32-bit data path     on multiple processors. These requirements added
                           into four 8-bit or two 16-bit slices. This differs from   inherent complexity in the interrupt handler, sched-
                           many other implementations, which require addi-           uler, and context switch.
                           tional specialized data paths for SIMD operations.           One optimization extended the exception han-
                              Figure 2 shows the improvements in MHz that            dling instructions to save precious cycles during the
                           various codecs require when using the ARMv6               time-critical context switch. ARM achieved this by
                           SIMD instructions introduced in the ARM11                 adding three new instructions to the instruction set,
                           processor.                                                as Table 1 shows.
                              The lightweight ARM implementation of SIMD                Programmers can use the change processor state
                           reduces gate count, hence significantly reducing die      (CPS) instruction to alter processor state by setting
44                   Computer
                                                          Table 1. Exception handling instructions in the ARMv6 architecture.
                                                          Instruction     Description                 Action
                                                          CPS             Change processor state      CPS<effect> <iflags>,{,#mode}
                                                                                                      CPS #<mode>
                                                                                                      CPSID <flags>
the current program status register to supervisor                                                     CPSIE <flags>
mode and disabling fast interrupt requests, as the        RFE             Return from exception       RFE<addressing_mode> Rn!
code in Figure 3 shows. Whereas the ARMv4T ISA            SRS             Save return state           SRS<addressing_mode>,#<mode>{!}
required four instructions to accomplish this task,
the ARMv6 ISA requires only two.
   Programmers can use the save return state (SRS)
instruction to modify the saved program status reg-
ister in a specific mode. The updating of SPSR in a       ARMv4T ISA                        ARMv6 ISA
particular ARMv4T ISA mode involved many                  ; Copy CPSR                              ; Change processor state and modify
more instructions than in the ARMv6 ISA. This                   MRS r3, CPSR                       ; select bits
new instruction is useful for handling context            ; Mask mode and FIQ interrupt                 CPSIE    f, #SVC
switches or preparing to return from an exception               IC      r3, r3, #MASK|FIQ
handler.                                                  ; Set Abort mode and enable FIQ
                                                                ORR     r3, r3, #SVC|nFIQ
MULTIPROCESSOR ATOMIC INSTRUCTIONS                        ; Update the CPSR
  Earlier ARM architectures implemented sema-                   MSR CPSR_c, r3
phores with the swap instruction, which held the
external bus until completion. Obviously, this was
                                                                                                                        Figure 3. ARMv6 ISA
unacceptable for thread-level parallelism because       ARM11, the memory management unit logic
                                                                                                                        change processor
one processor could hold the entire bus until com-      resides between the level 1 cache and the proces-
                                                                                                                        state instruction
pletion, disallowing all other processors. ARMv6        sor core. The reduction in cache flushing has the
                                                                                                                        compared with the
introduced two new instructions—load-exclusive          additional benefit of decreasing overall power con-
                                                                                                                        ARMv4T
LDREX and store-exclusive STREX—which take              sumption by reducing the external memory
                                                                                                                        architecture.
advantage of an exclusive monitor in memory:            accesses that occur in a virtually tagged cache. The
                                                        physically tagged cache increases overall perfor-
  • LDREX loads a value from memory and sets            mance by about 20 percent.
    the exclusive monitor to watch that location,
    and                                                 INSTRUCTION-LEVEL PARALLELISM
  • STREX checks the exclusive monitor and, if             In ILP, the processor can execute multiple instruc-
    no other write has taken place to that location,    tions from a single sequence of instructions con-
    performs the store to memory and returns a          currently. This form of parallelism has significant
    value to indicate if the data was written.          value in that it provides additional overall perfor-
                                                        mance without affecting the software programming
Thus, the architecture can implement semaphores         model.
that do not lock the system bus that grants other          Obviously, ILP puts more emphasis on the com-
processors or threads access to the memory              piler that extracts it from the source code and
system.                                                 schedules the instructions across the superscalar
   The ARM11 microarchitecture was the first            core. Although potentially simplifying otherwise
hardware implementation of the ARMv6 ISA. It            overly complex hardware, an excessive drive to
has an eight-stage pipeline with separate parallel      extract ILP and achieve high performance through
pipelines for the load/store and multiply/accumu-       increased MHz has increased hardware complex-
late operations. With the parallel load/store unit      ity and cost.
the ARM1136J-S processor can continue execut-              ARM has remained the processor at the edge of
ing without waiting for slower memory—a main            the network for several years. This area has always
gating factor for processor performance.                seen the most rapid advancements in technology,
   In addition, the ARM1136J-S processor has            with continuous migration away from larger com-
physically tagged caches to help with thread-level      puter systems and toward smaller ones. For exam-
parallelism—as opposed to the virtually tagged          ple, technology developed for mainframes a decade
caches of previous ARM processors—which con-            or two ago is found in desktop computers today.
siderably benefits context switches, especially when    Likewise, technologies developed for the desktop
running large operating systems.                        five years ago have begun appearing in consumer
   A virtually tagged cache must be flushed every       and network products. For example, symmetric
time a context switch takes place because the cache     multiprocessing (SMP) is appearing today in both
contains old virtual-to-physical translations. In the   desktop and embedded computers.
                                                                                                                    July 2005                 45
                              PERFORMANCE VERSUS POWER                           operating systems, ARM applied these enhance-
                              REQUIREMENTS                                       ments, known as ARMv6K or AOS (for Advanced
     Continued demand
                                 For several years, the embedded processor       OS Support), across all ARMv6-architecture-based
     for performance at       market inherited technology matured in desk-       application processors to provide a firm founda-
      low power has led       top computing as consumers demanded sim-           tion for embedded software.
       to minimizing the      ilar functionality in their embedded devices.         The ARM11 multiprocessor also addressed the
         overall power        The continued demand for performance at            SMP system design’s two main bottlenecks:
                              low power has, however, driven slightly dif-
       budget by adding       ferent requirements and led to the overall           • interprocessor communication with the inte-
     multiple processors      power budget being minimized by adding                 gration of the new ARM Generic Interrupt
      and accelerators.       multiple processors and accelerators within            Controller (GIC), and
                              an embedded design. Today, the demand for            • cache coherence with the integration of the
                              high levels of general-purpose computing dri-          Snoop Control Unit (SCU), an intelligent mem-
                              ves using SMP as the application processor             ory-communication system.
                       in both embedded and desktop systems.
                          In 2004, both the embedded and desktop mar-            These logic blocks deliver an efficient, hardware-
                       kets hit the cost-performance-through-MHz wall.           coherent single-core SMP processor that manufac-
                       In response, developers began embracing poten-            turers can build cost-effectively.
                       tial solutions that require SMP processing to avoid
                       the following pitfalls:                                   PREPARATIONS FOR ARM
                                                                                 MULTIPROCESSING
                           • High MHz costs energy. Increasing a proces-            To fully realize the advantages of a multiproces-
                             sor’s clock rate has a quadratic effect on power    sor hardware platform in general-purpose com-
                             consumption. Not only does doubling the             puting, ARM needed to provide a cache-coherent,
                             MHz double the dynamic power required to            symmetric software platform with a rich instruc-
                             switch the logic, it also requires a higher oper-   tion set. ARM found that a few key enhancements
                             ating voltage, which increases at the square of     to the current ARMv6 architecture could offer the
                             the frequency. Higher frequencies also add to       significant performance boost it sought.
                             design complexity, greatly increasing the
                             amount of logic the processor requires.             Enhanced atomic instructions
                           • Extracting ILP is complex and costly. Using            Researchers can use the ARMv6 load-and-store
                             hardware to extract ILP significantly raises the    exclusives to implement both swap-based and com-
                             cost in silicon area and design complexity, fur-    pare-and-exchange-based semaphores to control
                             ther increasing power consumption.                  access to critical data. In the traditional server com-
                           • Programming multiple independent proces-            puting world of SMP there has, however, been sig-
                             sors is nonportable and inefficient. As devel-      nificant software investment in optimizing SMP
                             opers use more processors, often with dif-          code using lock-free synchronization. This work
                             ferent architectures, the software complexity       has been dominated by the x86 architecture and its
                             escalates, eliminating any portability between      atomic instructions that developers can use to com-
                             designs.                                            pare and exchange data.
                                                                                    Many favored using the Intel cmpxchg8b instruc-
                       In mid-2004, PC manufacturers and chip makers             tion in these lock-free routines because it can
                       made several announcements heralding the end of           exchange and compare 8 bytes of data atomically.
                       the MHz race in desktop processors and champi-            Typically, this involved 4 bytes for payload and 4
                       oning the use of multicore SMP processors in the          bytes to distinguish between payload versions that
                       server realm, primarily through the introduction of       could otherwise have the same value—the so-called
                       hyperthreading in the Intel Pentium processor. At         A-B-A problem.
                       that time, ARM announced its ARM11 MPCore                    The ARM exclusives provide atomicity using the
                       multiprocessor core as a key solution to help             data address rather than the data value, so that the
                       address the demand for performance scalability.           routines can atomically exchange data without
                         Introduced alongside the ARM11 MPCore, a set            experiencing the A-B-A problem. Exploiting this
                       of enhancements to the ARMv6 architecture pro-            would, however, require rewriting much of the
                       vides further support for advanced SMP operating          existing two-word exclusive code. Consequently,
                       systems. In its move to support richer SMP-capable        ARM added instructions for performing load-and-
46                Computer
store exclusives using various payload sizes—             part of the Linux 2.6 kernel. This Posix
including 8 bytes—thus ensuring the direct porta-         thread library provides significant perfor-
bility of existing multithreaded code.                    mance improvements over the old Linux                   Spin-lock causes
                                                          pthread library.                                          the processor
Improved access to localized data                                                                                 to spin around a
   When an OS encounters the increasing number of         Power-conscious spin-locks                               tight loop while
threaded applications typical in SMP platforms, it           Another SMP system cost involves the syn-
must consider the performance overheads of asso-          chronization overhead required when
                                                                                                                      continually
ciating thread-specific state with the currently exe-     processors must access shared data. At the                attempting to
cuting thread. This can involve, for example,             lowest abstraction level in most SMP syn-                acquire a lock.
knowing which CPU a thread is executing on,               chronization mechanisms, a spin-lock soft-
accessing kernel structures specific to a thread, and     ware technique uses a value in memory as a
enabling thread access to local storage. The AOS          lock. If the memory location contains some
enhancements add registers that help with these           predefined value, the OS considers the shared
SMP performance aspects.                                  resource locked, otherwise it considers the resource
   CPU number. Using the standard ARM system              unlocked. Before any software can access the
coprocessor interface, software on a processor can        shared resource, it must acquire the lock—and an
execute a simple, nonmemory-accessing instruc-            atomic operation must acquire it. When the soft-
tion to identify the processor on which it executes.      ware finishes accessing the resource, it must release
Developers use this as an index into kernel struc-        the lock.
tures.                                                       In an SMP OS, processors often must wait while
   Context registers. SMP operating systems handle        another processor holds a lock. The spin-lock
two key demands from the kernel when providing            received its name because it accomplishes this wait-
access to thread-specific data. The ARMv6K archi-         ing by causing the processor to spin around a tight
tecture extensions define three additional system         loop while continually attempting to acquire the
coprocessor registers that the OS can manage for          lock. A later refinement to reduce bus contention
whatever purpose it sees fit. Each register has a dif-    added a back-off loop during which the processor
ferent access level:                                      does not attempt to access the lock. In either case,
                                                          in a power-conscious embedded system, these
  • user and privileged read/write accessible;            unproductive cycles obviously waste energy.
  • read-only in user, read/write privileged acces-          The AOS extensions include a new instruction
    sible; and                                            pair that lets a processor sleep while waiting for a
  • privileged only read/write accessible.                lock to be freed and that, as a result, consumes less
                                                          energy. The ARM11 multiprocessor implements
The exact use of these registers is OS-specific. In the   these instructions in a way that provides next-cycle
Linux kernel and GNU toolchain, the ARM appli-            notification to the waiting processor when the lock
cation binary interface has assigned these registers      is freed, without requiring a back-off loop. This
to enable thread local storage. A thread can use TLS      results in both energy savings and a more efficient
to rapidly access thread-specific memory without          spin-lock implementation mechanism.
losing any of the general-purpose registers.                 Figure 4 shows a sample implementation of the
   To support TLS in C and C++, the new keyword           spin lock and unlock code used in the ARM Linux
thread has been defined for use in defining and           2.6 kernel.
declaring a variable. Although not an official exten-
sion of the language, using the keyword has gained        Weakly ordered memory consistency
support from many compiler writers. Variables                The ARMv6 architecture defined various mem-
defined and declared this way would automatically         ory consistency models for the different definable
be allocated locally to each thread:                      memory regions. In the ARM11 multiprocessor,
                                                          spin-lock code uses coherently cached memory to
  __thread int i;                                         store the lock value. As a multiprocessor, the
  __thread struct state s;                                ARM11 MPCore is the first ARM processor to
  extern __thread char *p;                                fully expose weakly ordered memory to the pro-
                                                          grammer. The multiprocessor uses three instruc-
Supporting TLS is a key requirement for the new           tions to control weakly ordered memory’s side
Native Posix Thread Library (NPTL) released as            effects:
                                                                                                              July 2005               47
          static inline void _raw_spin_lock(spinlock_t *lock)
          {
              unsigned long tmp;
                _asm__ __volatile__(
                 1: ldrex     %0, [%1]                          ;   exclusive read lock
                     teq      %0, #0                            ;   check if free
                     wfene                                      ;   if not, wait (saves power)
                     strexeq %0, %2, [%1]                       ;   attempt to store to the lock
                     teqeq    %0, #0                            ;   Were we successful ?
                     bne      1b                                ;   no, try again
                  : “=&r” (tmp)
                  : “r” (&lock->lock), “r” (1),                  “r” (0)
                  : “cc”, “memory”
                );
     Figure 4. Power-
                             • wmb(). This Linux macro creates a write-                 tiprocessor, this macro can be defined as empty.
     conscious spin-lock.
                               memory barrier that the multiprocessor can             • DSB (Drain Store Buffer). The ARM architec-
     This sample imple-
                               use to place a marker in the sequencing of any           ture includes a buffer visible only to the proces-
     mentation shows
                               writes around this barrier instruction. The              sor. When running uniprocessor software, the
     the spin-lock and
                               spin-lock, for example, executes this instruc-           processor allows subsequent reads to scan the
     unlock code used
                               tion prior to unlocking to ensure that any               data from this buffer. However, in a multi-
     in the ARM Linux
                               writes to the payload data complete before the           processor, this buffer becomes invisible to reads
     2.6 kernel.
                               write to release the spin-lock, and hence before         from other processors. The DSB drains this
                               any other processor can acquire the lock. To             buffer into the L1 cache. In the ARM11 multi-
                               ensure higher performance, the barrier does              processor, which has a coherent L1 cache
                               not necessarily stall the processor by flushing          between the processors, the flush only needs to
                               data. Rather it informs the load-store unit and          proceed as far as the L1 memory system before
                               lets execution continue in most situations.              another processor can read the data. In the
                             • rmb(). Again from the Linux kernel, this macro           spin-lock unlock code, the processors issue the
                               places a read-memory barrier that prevents               DSB immediately prior to the SEV (Set Event)
                               speculative reads of the payload from occur-             instruction so that any processor can read the
                               ring before the read has acquired the lock.              correct value for the lock upon awakening.
                               Although legal in the ARMv6 architecture, this
                               level of weakly ordered memory can make it
                               difficult to ensure software correctness. Thus,      ARM11 MPCORE MULTIPROCESSOR
                               the ARM11 multiprocessor implements only                The ISA’s suitability is not the only factor affect-
                               nonspeculative read-ahead. When the possi-           ing the multiprocessor’s ability to actually deliver
                               bility exists that a read will not be required, as   the scalability promises of SMP. If they are poorly
                               in the spin-lock case—where there is a branch        implemented, two aspects of an SMP design can
                               instruction between the teqeq instruction and        significantly limit peak performance and increase
                               any payload read—the read-ahead does not             the energy costs associated with providing SMP
                               take place. So, for the ARM11 MPCore mul-            services:
48                    Computer
                                                                                                                                        Figure 5. ARM11
                                                  Configurable number of hardware interrupt lines               Private FIQ lines
                                                                                                                                        MPCore. This
                                                                                                                                        multiprocessor
                                                                                                                                        integrates the new
                                                                              …                                                         ARM GIC inside the
                                                                                                                                        core to make the
                                                                                                                                        interrupt system’s
                                                               Interrupt distributor
                                                                                                                                        access and effects
                                                                                                                                        much closer and
    Per-CPU
    aliased                                                                                                                             more efficient.
    peripherals               Timer                  Timer                    Timer                 Timer
                                        CPU                      CPU                      CPU                 CPU
                              Wdog    interface     Wdog       interface      Wdog      interface   Wdog    interface
    Configurable
    between                             IRQ                      IRQ                      IRQ                 IRQ
    1 and 4
    symmetric
    CPUs
• Cache coherence. Developers typically provide                            tected resource. Other SMP OS communica-
  the single-image SMP OS with coherent caches                             tion between CPUs is best accomplished with-
  so it can maintain performance by placing its                            out accessing memory. Systems frequently
  data in cached memory. In the ARM11 multi-                               must also synchronize asynchronously. One
  processor, each CPU has its own instruction                              such mechanism uses the device’s interrupt sys-
  and L1 data cache. Existing coherency schemes                            tem to cause activity on a remote processor.
  often extend the system bus with additional sig-                         These software-initiated interprocessor inter-
  nals to control and inspect other CPUs’ caches.                          rupts (IPI) typically use an interrupt system
  In an embedded system, the system bus often                              designed to interface interrupts from I/O
  clocks slower than the CPU. Thus, besides plac-                          peripherals rather than another CPU.
  ing a bottleneck between the processor and its
  cache, this scheme significantly increases the                      Figure 5 shows how the ARM11 MPCore inte-
  traffic and hence the energy the bus consumes.                    grates the new ARM GIC inside the core to make
  The ARM11 MPCore addresses these prob-                            the interrupt system’s access and effects closer and
  lems by implementing an intelligent SCU                           more efficient. ARM designed the GIC to optimize
  between each processor. Operating at CPU fre-                     the cost for the key forms of IPI used in an SMP OS.
  quency, this configuration also provides a very
  rapid path for data to move directly between                      Interrupt subsystem
  each CPU’s cache.                                                   A key example of IPI’s use in SMP involves a
• Interprocessor communication. An SMP OS                           multithreaded application that affects some state
  requires communication between CPUs, which                        within the processor that is not hardware-coher-
  sometimes is best accomplished without                            ent with the other processors on which the appli-
  accessing memory. Also, the system must often                     cation process has threads running. This can occur
  regulate interprocessor communication using                       when, for example, the application allocates some
  a spin-lock that synchronizes access to a pro-                    virtual memory. To maintain consistency, the OS
                                                                                                                                    July 2005                49
          also must apply these memory translations to all          state. This optimization also causes the processor to
          other processors. In this example, the OS would           transfer the cache line directly to the other processor
          typically apply the translation to its processor and      without intervening external memory operations.
          then use the low-contention private peripheral bus           This ability to move shared data directly between
          to write to an interrupt control register in the GIC      processors provides a key feature that programmers
          that causes an interrupt to all other processors. The     can use to optimize their software. When defining
          other processors could then use this interrupt’s ID       data structures that processors will share, pro-
          to determine that they need to update their mem-          grammers should ensure appropriate alignment and
          ory translation tables.                                   packing of the structure so that line migration can
             The GIC also uses various software-defined pat-        occur. Also, if the programmers use a queue to dis-
          terns to route interrupts to specific processors          tribute work items across processors, they should
          through the interrupt distributor. In addition to         ensure that the queue is an appropriate length and
          their dynamic load balancing of applications, SMP         width so that when the worker processor picks up
          OSs often also dynamically balance the interrupt          the work item, it will transfer it again through this
          handler load. The OS can use the per-processor            cache-to-cache transfer mechanism. To aid with this
          aliased control registers in the local private periph-    level of optimization, the MPCore includes hard-
          eral bus to rapidly change the destination CPU for        ware instrumentation for many operations within
          any particular interrupt.                                 both the traditional L1 cache and the SCU.
             Another popular approach to interrupt distribu-
          tion sends an interrupt to a defined group of proces-
          sors. The MPCore views the first processor to                  he ARMv6K ISA can be considered a key mul-
          accept the interrupt, typically the least loaded, as
          being best positioned to handle the interrupt. This
          flexible approach makes the GIC technology suit-
                                                                    T    tiprocessor-aware instruction set. With its
                                                                         foundation in low-power design, the archi-
                                                                    tecture and its implementation in the ARM11
          able across the range of ARM processors. This stan-       MPCore can bring low power to high-performance
          dardization, in turn, further simplifies how              designs. These new designs show the potential to
          software interacts with an interrupt controller.          truly change how people access technology. With
                                                                    more than 1.5 billion ARM processors being sold
          Snoop control unit                                        each year, there is a huge range of markets in which
             The MPCore’s SCU is an intelligent control             ARM developers can use their software code. ■
          block used primarily to control data cache coher-
          ence between each attached processor. To limit the
          power consumption and performance impact from             References
          snooping into and manipulating each processor’s           1. D. Seal, ARM Architecture Reference Manual, 2nd
          cache on each memory update, the SCU keeps a                 ed., Addison-Wesley Professional, 2000.
          duplicate copy of the physical address tag (pTag)         2. A. Sloss et al., ARM System Developer’s Guide, Mor-
          for each cache line. Having this data available              gan Kauffman, 2004.
          locally lets the SCU limit cache manipulations to
          processors that have cache lines in common.               John Goodacre is a program manager at ARM with
             The processor maintains cache coherence with           responsibility for multiprocessing. His interests
          an optimized version of the MESI (modified, exclu-        include all aspects of both hardware and software
          sive, shared, invalid) protocol. With MESI, some          in embedded multiprocessor designs. Goodacre
          common operations, such as A = A + 1, cause many          received a BSc in computer science from the Uni-
          state transitions when performed on shared data.          versity of York. Contact him at john.goodacre@
             To help improve performance and further reduce         arm.com.
          the power overhead associated with maintaining
          coherence, the intelligence in the SCU monitors the       Andrew N. Sloss is a principle engineer at ARM.
          system for a migratory line. If one processor has a       His research interests include exception handling
          modified line, and another processor reads then           methods, embedded systems, and operating system
          writes to it, the SCU assumes such a location will        architecture. Sloss received a BSc in computer sci-
          experience this same operation in the future. As this     ence from the University of Hertfordshire. He is
          operation starts again, the SCU will automatically        also is a Chartered Engineer and a Fellow of the
          move the cache line directly to an invalid state rather   British Computer Society. Contact him at andrew.
          than expending energy moving it first into the shared     sloss@arm.com.
50 Computer